intro
Your data is dirty! What associations do these words give you? Probably not the warmest ones; in this blog, we’ll help you find out what dirty data is and how exactly it pollutes your precious database.
There is a lot of data that is traveling through this world. Every day, billions of consumers use social media networks, university and gaming platforms, attend conferences or workshops, and read books. Now that you’re reading this blog, each of these actions takes a toll on databases and the data they contain!
While we constantly walk you through the secrets of MySQL, PostgreSQL, SQL Server, and other database management systems (you do read the blog, don’t you?), we never touched upon another sensitive topic: dirty data in your database.
What Is Dirty Data?
Dirty data, as the name suggests, is data that has something wrong with it. “Something wrong” can have a variety of meanings, but in this case, we’d be referring to inconsistent, incomplete, or outdated information.
Think of it like this: if you have a use case necessitating you to select data and a basic SELECT
query returns duplicate records with misspelled entries and missing values, what would you do? Probably curse the DBA, yes?
The DBA, however, isn’t always the one at fault: as dirty data often refers to outdated information or even incorrectly defined indexes or partitions (depending on the context), it isn’t right to always blame them.
Instead, in most cases, developers are the ones at fault. Think about it: the DBA is most likely the one fixing issues, and developers are working on the application every single day. They may be inserting, updating, deleting, or selecting information from the database management system, and as they do so, they may make mistakes that lead to data pollution and dirty data. Dirty data that is then noticed by you or someone else.
How Dirty Data Pollutes Your Database
From the outset, dirty data is pretty easy to grasp and understand: it refers to data that isn’t “right” - isn’t consistent, complete, or doesn’t adhere to specific standards. Take a look at this data set — is this data dirty?

“Hard to say”, I hear you saying. Let’s dig a little deeper by running a GROUP BY
SQL query like so:
1
SELECT email, COUNT(*) AS entrycount
2
FROM users
3
GROUP BY email HAVING COUNT(*) > 1;
Launch it on a visual database client like DbVisualizer, and assume the result is:

Now we can see that on average, we have about 75,000 duplicate emails in a table. Is this data dirty? If this table is a demo table demonstrating a function making use of a lot of duplicate entries, perhaps not; otherwise, certainly yes. In other words, whether our data is considered dirty or not also depends on our specific use case.
Solutions to Dirty Data Inside Your Database
There are a couple of things you can do to alleviate the burden of dirty data inside your database management system:
These tips will help you alleviate the problems caused by dirty data on a database level. To take care of dirty data in your application, make sure to be wary of what data is inserted, updated, selected, and deleted within your application.
Summary
Dirty data is something none of us want to see or interact with because of three reasons:
This blog has walked you through how to see whether the data in your database is dirty and what to do if it is: follow us on social media if this blog has helped you, and until next time.
FAQ
What is dirty data?
Dirty data refers to a subset of data that is inconsistent, incomplete, duplicated, outdated, or otherwise “dirty.” What dirty data means in your specific use case has to do with your application and database.
How to deal with dirty data in a database?
To deal with dirty data in a database, first identify it (identify the rows you would consider “dirty”), then update or delete them as necessary.
What things should I keep in mind when optimizing SQL queries and cleaning up my database?
When cleaning your database from dirty data, always keep in mind that it’s important to be well aware of the server your database and application are built on. With that said, always be mindful of the settings in your configuration file and make sure the values in your configuration file don’t clash with the values of your server.
Also, always keep your knowledge up to date by reading database blogs and books on database performance, and you should be good to go.