BIG DATA

Dangerous Big Data - Big Data Pitfalls to Avoid

intro

Running big data on your SQL database instances? Read this blog and deal with big data sets inside your database instances properly.

Tools used in the tutorial
Tool Description Link
Dbvisualizer DBVISUALIZER
TOP RATED DATABASE MANAGEMENT TOOL AND SQL CLIENT

The performance of databases on bigger data sets is something that’s always of concern to developers. As our data gets bigger, our databases have the increasingly hard task of keeping up with performance requirements. Unfortunately, a dangerous big data scenario can happen.

In this blog, we will walk you through things you should avoid when dealing with big data sets in your database instances.

Getting Started

Here’s a quick set of questions for you before getting into the topic of this guide:

QuestionAnswer
Is this your first time working with bigger data sets in database instances?If you’re a seasoned professional who’s dealing with bigger data sets day in and day out, you will be able to apply some lifehacks to the advice contained in this article. If not, no worries – we’ll walk you through everything you need to know, but it will take some time for you to ramp up.
How many rows are considered “https://en.wikipedia.org/wiki/Big_data?” How do I know if I’m in the territory of dangerous big data?Big data means different things to different people – the number varies depending on who you ask, but we think that anything over 100 million rows is a decent chunk of data that falls under that category. The amount of rows you have will directly impact your decision-making. Dangerous big data territory is the amount of rows that starts to terrify you because you’ve never worked with such an amount in the past.
What kind of DBMS are you using?This is one of the main questions you need an answer to – different database management systems work in different ways and solve different problems, so it’s important to do your research before sticking with any one database management system.
Do you have experience dealing with databases? Bigger data sets?Some of your experience will definitely need to be applied to situations involving big data – in many cases, the simplest solutions work the best.

After you have the answers to these questions, you will be able to move further and explore the landscape of your database. Consider also checking out our guide on how to deal with a database with billions of records.

What DBMS to Choose?

The database management system powering your database is probably the most important thing you have to make a decision. Many experienced DBAs (database administrators) derive answers from their professional experience, but for others, the choice may not be so obvious. Consider this:

  • Non-relational database management systems, such as MongoDB, avoid using the row-column data storage model popular with relational database management systems. In such databases, there are no tables, rows, primary or foreign keys. Such database management systems can store data in one of 5 ways by also making use BASE, making our dangerous big data sets adhere to the Basically Available, Eventually Consistent principles:
    • Document data stores – such non-relational DBMSes store data in JSON documents.
    • Columnar data stores – such an approach organizes data into columns.
    • Key-value stores – as the name suggests, data is stored by “key-value” pairs contained within something.
    • Document stores – such a storage solution allows developers to store their documents and allows to run complex SQL queries.
    • Graph databases – very complex storage solution used for data that has multiple links to interconnected data sets.
  • Relational database management systems are a good fit for someone looking for a key-value data store. Relational database management systems store data sets that are “related” and they make dangerous big data a little less dangerous because they’re known to support the ACID model, meaning that even if our electricity goes out mid-query, we’re fine.

After you know this, choose the model most suitable for your specific use case, then proceed into choosing the database management system itself. Be aware that there are loads of data warehousing solutions like AWS Redshift or similar, but the underlying DBMS housing your data is still one of the most important parts for you to decide on.

Configuring the Database

First off, databases must be configured properly. That’s true for any amount of data – database configuration can take us a long way, regardless of whether we’re in a dangerous big data territory or not. Here’s what to do:

  1. Select your server properly: Many of you will ask whether to pick a VPS (Virtual Private Server) or a dedicated server, and the answer is quite simple – if you’re just starting out, a VPS will do, but later definitely switch to a dedicated server instead.
  2. Read up on the documentation concerning your specific database management system of choice: MySQL Server has my.cnf, PostgreSQL has postgres.conf, MongoDB has mongod.conf, etc) These configuration files can make or break your database instance, so make sure to be aware of the parameters that you can configure. The better your database configuration, the faster your queries will be.
  3. Look into your experience and learn from others: Perhaps, you’ve built something on big data sets in the past? How did that go? If not, read blogs, go to conferences, and watch videos related to working with bigger data sets inside of database management systems – those will definitely take you a long way.

These three steps are the main things you need to do in order to make your database avoid big data dangers – of course, you should be aware of your server capabilities when configuring your database too, but it doesn‘t get much more complex than that. Instead, many dangerous big data pitfalls come from situations surrounding your database or data sets. Here’s how to avoid them!

Dangerous Big Data – Pitfalls to Avoid

Some of the most frequent pitfalls in the dangerous big data world are as follows:

  1. Misconfiguration of database instances
  2. A setup unfit for bigger data sets
  3. Slow data import speed
  4. Improperly formed search queries
  5. Improper ways of analyzing big data sets

These problems are where 80% of big data-related issues come from, and they can manifest in many forms including, but not limited to:

  1. Queries timing out
  2. Actions involving certain queries (e.g., INSERTs, etc.) being unreasonably slow
  3. You're running out of memory / disk space
  4. Issues with data normalization
  5. Schrödinger‘s tables in a database instance

These problems aren‘t new to anyone; however, when bigger data sets are involved, these pitfalls must be properly accounted for to avoid disaster.

How to Avoid Pitfalls?

Here’s to overcome these big data problems:

  1. To prevent your queries from timing out, ensure that you:
    1. Only search through data that is absolutely necessary (i.e. use “SELECT column” instead of “SELECT *”)
    2. Normalize all of your databases and tables within them.
    3. Scan through as little data as possible (for that, use partitions and indexes, also refer to point a.)
    4. Use a proper storage engine for your day-to-day operations. To choose a proper storage engine, refer to the documentation of the DBMS of your choice.
    5. Configure your database instances properly and make use of the features provided by your storage engines (e.g., MySQL’s InnoDB has a buffer pool size which must be set to at least 60% of available RAM for decent performance – log files should be around a quarter of its size.) Other database management systems have very similar approaches.
  2. To prevent queries from being unreasonably slow, make use of the advice in the point 1, and also keep in mind the way your database management system works: it’s wise to provide 60% of the available RAM in your system to something that INSERT queries interact with (for MySQL, it’s the buffer pool.)
  3. To prevent running out of memory / disk space, know how your DBMS and the functions within it work internally by reading the docs and learning from your own experience (e.g. ALTER queries make a copy of the table on the disk), then use your Linux knowledge to always keep an eye out on your server.
  4. Make sure to normalize all of your database instances properly – there are a couple of types of normalization (1NF, 2NF, 3NF, BCNF, etc.) and each of them has its own pros and cons, so before you nomalize, weigh all of their pros and cons too.
  5. Make use of ACID capabilities . Databases based on the BASE principles (non-relational database management systems) often don’t have a very good way of “saving” data after a power loss or similar failures, which are one of the primary causes leading to Schrödinger‘s tables (tables that exist and don’t exist at the same time.)

These tips will be a good base when starting off – however, you must keep in mind that in many cases, advice like that won’t be enough, and you will need to learn from the real world too.

Real-world Problems

Here are some problems we’ve observed when working with bigger data sets over the years:

  1. Be very careful with indexes: SQL indexes are a very powerful friend in the big data world, however, in many database systems they do have limitations including the length of the column and such. These problems can be exacerbated the more data you have, so be aware of the length of your columns and data contained within them before indexing.
  2. Partitions are tables, too: For some, this may sound obvious, but keeping such advice in mind is crucial when your data gets bigger and bigger. There are multiple types of partitioning too, so make sure to choose the type of partitioning your databases are most happy with in your specific use case. And keep in mind that partitions occupy space!
  3. Don‘t over-select: As we’ve already told you, running queries like “SELECT column” instead of “SELECT *” is basic, but you would be surprised how many use cases there are involving tatis.
  4. Your server has limits: Before optimizing your database instance, be intimately aware of what your server can and cannot do – don’t push it too far and do everything with your data (and databases) in mind.
  5. ACID is important (sometimes): ACID is a very good feature allowing us to not worry about data corruption or any such issues once our server crashes in the middle of a SQL query, but did you know that certain database management systems including MySQL have the ability to switch off ACID for speed? That may be very beneficial for those who are running archival-based systems.
  6. Choose a storage engine properly: A storage engine is always an essential part of your database’s lifecycle. Database engines are made to solve problems – make sure to choose the one that solves the most problems for you.
  7. Conferences help: No, seriously. Once you hear a speech related to “How I built a massive database using [the database you’re working with] and what I learned,” it’s time to go and learn. In many cases, speakers share insightful detail regarding their work and the data aspects related to it.

And last but not least, also consider SQL clients. Seriously – you have millions (or even billions) of rows in your database instance, and you’re not using one?! Madness.

SQL clients like the one provided by DbVisualizer connect to all the most popular databases. Stop working with your data as if it was a spreadsheet and explore its visual query builder, extensive visualization capabilities, and  many other features. Try it today for free!

Summary

You’ve read this blog in its entirety, and that means that you should now put your knowledge to the test – make sure you’ve configured your DBMS properly, and don’t fall into any of the dangerous big data pitfalls mentioned in this article. And now that you’re aware of most of the pitfalls you need to avoid when working with bigger data sets, why not read more of our blogs to come up with more solutions to problems? Come back to our blog later on, and until next time.

Dbvis download link img
About the author
LukasVileikisPhoto
Lukas Vileikis
Lukas Vileikis is an ethical hacker and a frequent conference speaker. He runs one of the biggest & fastest data breach search engines in the world - BreachDirectory.com, frequently speaks at conferences and blogs in multiple places including his blog over at lukasvileikis.com.
The Table Icon
Sign up to receive The Table's roundup
More from the table
Title Author Tags Length Published
title

What Are Vector Databases?

author Lukas Vileikis tags BIG DATA SQL 6 min 2024-08-08
title

How To Deal With a Database With Billions of Records

author Antonello Zanini tags BIG DATA 7 MINS 2023-01-06
title

Counter in MySQL: Counting Rows with Ease

author Lukas Vileikis tags MySQL SQL 8 min 2024-10-03
title

PostgreSQL Subquery - A Guide

author Leslie S. Gyamfi tags POSTGRESQL SUBQUERY 3 min 2024-10-02
title

A Complete Guide to the SQL Server FOR XML PATH Clause

author Antonello Zanini tags SQL SERVER XML 8 min 2024-10-01
title

SQL OFFSET: Skipping Rows in a Query

author Antonello Zanini tags MySQL ORACLE POSTGRESQL SQL SQL SERVER 7 min 2024-09-30
title

The MySQL RENAME COLUMN Statement: How, What, and Why

author Lukas Vileikis tags MySQL SQL 6 min 2024-09-26
title

A Guide to the Postgres Not Null Constraint

author Leslie S. Gyamfi tags POSTGRESQL 3 min 2024-09-25
title

SQL FETCH: Retrieving Data In Database Cursors

author Antonello Zanini tags MySQL ORACLE POSTGRESQL SQL SQL SERVER 8 min 2024-09-24
title

A Complete Guide to the SQL Server COALESCE Function

author Antonello Zanini tags SQL SERVER 6 min 2024-09-23

The content provided on dbvis.com/thetable, including but not limited to code and examples, is intended for educational and informational purposes only. We do not make any warranties or representations of any kind. Read more here.

Cookie policy

We use cookies to ensure that we give you the best experience on our website. However you can change your cookie settings at any time in your browser settings. Please find our cookie policy here ↗