intro

Your data is dirty! What associations do these words give you? Probably not the warmest ones; in this blog, we’ll help you find out what dirty data is and how exactly it pollutes your precious database.

Tools used in the tutorial

DBVISUALIZER

TOP RATED DATABASE MANAGEMENT TOOL AND SQL CLIENT

DOWNLOAD

There is a lot of data that is traveling through this world. Every day, billions of consumers use social media networks, university and gaming platforms, attend conferences or workshops, and read books. Now that you’re reading this blog, each of these actions takes a toll on databases and the data they contain!

While we constantly walk you through the secrets of MySQL, PostgreSQL, SQL Server, and other database management systems (you do read the blog, don’t you?), we never touched upon another sensitive topic: dirty data in your database.

What Is Dirty Data?

Dirty data, as the name suggests, is data that has something wrong with it. “Something wrong” can have a variety of meanings, but in this case, we’d be referring to inconsistent, incomplete, or outdated information.

Think of it like this: if you have a use case necessitating you to select data and a basic SELECT query returns duplicate records with misspelled entries and missing values, what would you do? Probably curse the DBA, yes?

The DBA, however, isn’t always the one at fault: as dirty data often refers to outdated information or even incorrectly defined indexes or partitions (depending on the context), it isn’t right to always blame them.

Instead, in most cases, developers are the ones at fault. Think about it: the DBA is most likely the one fixing issues, and developers are working on the application every single day. They may be inserting, updating, deleting, or selecting information from the database management system, and as they do so, they may make mistakes that lead to data pollution and dirty data. Dirty data that is then noticed by you or someone else.

How Dirty Data Pollutes Your Database

From the outset, dirty data is pretty easy to grasp and understand: it refers to data that isn’t “right” - isn’t consistent, complete, or doesn’t adhere to specific standards. Take a look at this data set — is this data dirty?

Example of dirty data Inside MariaDB in DbVisualizer

“Hard to say”, I hear you saying. Let’s dig a little deeper by running a GROUP BY SQL query like so:

Copy

        
        
                            
                    1
                    SELECT email, COUNT(*) AS entrycount
                
                            
                    2
                    FROM users
                
                            
                    3
                    GROUP BY email HAVING COUNT(*) > 1;

Launch it on a visual database client like DbVisualizer, and assume the result is:

The Amount of Duplicate Emails in a Users Table

Now we can see that on average, we have about 75,000 duplicate emails in a table. Is this data dirty? If this table is a demo table demonstrating a function making use of a lot of duplicate entries, perhaps not; otherwise, certainly yes. In other words, whether our data is considered dirty or not also depends on our specific use case.

Solutions to Dirty Data Inside Your Database

There are a couple of things you can do to alleviate the burden of dirty data inside your database management system:

Consider building and running a cronjob that tracks the changes to your data after every insert, update, or delete operation. Changed data could be put into another table (without being accessible in production) that is then inspected to ensure adherence to specifications and other things.
Be wary of the data you insert or update: by keeping track of what you insert or update, you will be able to prevent many things from going wrong.
Before inserting any data into your production database, insert it into your local storage, then inspect it, and if everything looks good, move it up to a production level.

These tips will help you alleviate the problems caused by dirty data on a database level. To take care of dirty data in your application, make sure to be wary of what data is inserted, updated, selected, and deleted within your application.

Summary

Dirty data is something none of us want to see or interact with because of three reasons:

It takes up unnecessary space on our hard drives leaving less space for the data we need.
Such data may be inconsistent, incorrect, incomplete, or outdated, meaning that it’s risky or impossible to make rational decisions by looking at it.
Dirty data may also refer to duplicate records inside of the database management system or materialize itself as NULL values where data (rows) are expected.

This blog has walked you through how to see whether the data in your database is dirty and what to do if it is: follow us on social media if this blog has helped you, and until next time.

FAQ

What is dirty data?

Dirty data refers to a subset of data that is inconsistent, incomplete, duplicated, outdated, or otherwise “dirty.” What dirty data means in your specific use case has to do with your application and database.

How to deal with dirty data in a database?

To deal with dirty data in a database, first identify it (identify the rows you would consider “dirty”), then update or delete them as necessary.

What things should I keep in mind when optimizing SQL queries and cleaning up my database?

When cleaning your database from dirty data, always keep in mind that it’s important to be well aware of the server your database and application are built on. With that said, always be mindful of the settings in your configuration file and make sure the values in your configuration file don’t clash with the values of your server.

Also, always keep your knowledge up to date by reading database blogs and books on database performance, and you should be good to go.

How Dirty Data Pollutes Your Database

What Is Dirty Data?

How Dirty Data Pollutes Your Database

Solutions to Dirty Data Inside Your Database

Summary

FAQ

What is dirty data?

How to deal with dirty data in a database?

What things should I keep in mind when optimizing SQL queries and cleaning up my database?

Parsing and SQL Data Types: A Complete Guide

Best SQL Clients for Developers: Complete List

Best Database Tools for Business Users: Complete List

Best Database Tools for Analysts: Complete List

The HEAP Data Structure and in-Memory Data Explained

SQL Boolean Type: How to Use It in All Major Relational Databases

Best Database Tools for Developers: Ultimate List

Implementing Version Control for Your Database

What Happens When You Use the UNION and DISTINCT SQL Clauses Together?

OpenSearch vs Elasticsearch: Is OpenSeach an Open-Source Alternative to ElasticSearch?

Product

Industries

Legal

Company

Contact