intro
UTF-8 is one of the most popular encoding methods within databases. However, not all UTF-8 encoding methods are made the same – UTF-8 within MySQL isn’t really UTF-8 either. Why? Figure everything out here!
Data encoding has been around for as long as databases have been around. Properly encoding your data means that your data is going to be shown in a specific way – such an approach is exceptionally popular for those working with more “exotic” languages (think Korean, Chinese, Japanese, etc.), but it’s important for those working with data derived from English data sets as well. One of the reasons it’s important for english-speaking people is because encoding data differently yields different results in terms of storage as well – if one encoding method is in use, data may take up 1 byte of space, however, if another encoding method is being used, data may take up entire 4 bytes. If that sounds confusing, try to think about everything this way – if we insert a character weighing 3 bytes into a column that’s encoded with a method that can only take 2 bytes, our data will be shown as “???” – see the problem? That’s one of the problems of UTF-8 as well.
UTF-8 Explained
UTF-8, simply put, is an encoding method that translates any unicode-based character into a string. UTF-8 is one of the most frequently used encoding systems for Unicode and it’s everywhere – whatever you can imagine, it’s probably using UTF-8. Database management systems (Oracle, PostgreSQL, TimescaleDB, and all flavors of MySQL) are no exception to this rule, either – starting from MySQL 8.0, utf8mb4 is the default character set for all data existing within our MySQL infrastructure as well.
UTF stands for “Unicode Translation Format”, and it means just that – UTF translates unicode characters to matching binary strings and vice versa.
UTF-8 in MySQL
For those who are into databases and especially into MySQL, though, UTF-8 is not always UTF-8. The default character set starting from MySQL 8.0 is utf8mb4 instead of utf8, and that’s for a good reason – utf8mb4 supports a maximum of four bytes per character, while the original utf8 supports only up to three. Since that was the primary reason of some data showing incorrectly for some users of MySQL, the team over at MySQL was quick to adjust and change how things went along.
For some, utf8mb4 might seem a little confusing – the number at the end of the character set, though, simply refers to the number of supported bytes within data that can be shown to the user. The original utf-8, as already noted, only was able to support three bytes per character.
Four bytes per character means support for BMP (most commonly used) characters and supplementary characters as well – variations of utf8 can store different things:
Simply put, when utf8mb4 is in use, users can feel free to use symbols that require a lot of bytes per character – such symbols include newly introduced emojis, smileys like ☺, ☹, 😐 and so on. This wasn’t always that way, though – there were days when users of MySQL quickly ran into problems as far as their data encoding was concerned, and that’s precisely why utf8mb4 was invented.
Why utf8mb4?
The main reason why MySQL chose to use utf8mb4 instead of utf8 was precisely because of the reason mentioned above – utf8mb4 supports up to 4 bytes per character which means that it can support the use of emojis and other things – it offers full Unicode support. Aside from that, full Unicode support also means that certain characters won’t break the database itself. If utf8mb4 wouldn’t be in use, when characters requiring 3 or 4 bytes per character would be stored, our database would actually respond with an error like so:
1
Incorrect string value: ‘\x77\xD0’ for column ‘column_name_here’ at row 1
Types of utf8mb4 in MySQL
Various database management systems also have various types of utf8mb4 that can be used. MySQL alone has two of them and these are as follows:
Also, it’s worth noting that when any of those character sets would be in use, MySQL would store data without regard to case sensitivity: “ci” stands for “Case Insensitive.”
If you are looking for an easy and powerful SQL client and database manager, then you've got to try DbVisualizer. It connects to nearly any database.
Since MySQL’s UTF-8 is not the real UTF-8 for reasons described above, both utf8mb3 and utf8mb4 character sets are a frequent point of confusion for both developers and database administrators alike. That’s mostly because many users use incompatible collations for their character sets, so if you elect to use utf8mb3 or utf8mb4 instead of utf8 in MySQL (or if we’re already using this collation), we should also make sure to use an appropriate character set (the same character set.) Don’t fret – in most cases, it simply means that you should try your best to create tables with the same character set and collation (see example below):
For those who are wondering, creating a database with a specific character set and collation will also mean that all tables within that database will be created using that collation by default – in other words, there’s no need to specify the CHARACTER SET
and COLLATE
variables when creating a table if they don’t differ from the ones set within the database itself. We can run a SHOW TABLE STATUS
query to observe the collation of our tables like so:
When the utf8mb4 collation is in use, we can finally feel free to store all characters within our database without any problems. Easy, isn‘t it?
Going Beyond Encoding
By now, you should have a very good understanding of what UTF-8 is and how it works internally. However, if you really want to be a professional DBA, UTF-8 won’t do much – you already know that you have to cast your net much deeper than that.
Aside from character sets and collations, familiarize yourself with:
It’s not necessary to be an expert in all of those areas, however, knowing your way around them will take you futher than you could ever imagine – if you take database management seriously, you should also consider using database clients like DbVisualizer to help you in your everyday database management endeavors: since DbVisualizer has a long history of being the preferred database management tool for the world’s leading data professionals, is used by notable companies (Twitter, Tesla, and even NASA), and offers many powerful features ranging from a SQL editor to offering military-grade data encryption via SSH, letting you easily explore and visualize your data for an easier workflow, and comes with an in-built visual query builder, we’re confident your company will like it. Matter of fact, why not try it for free for 30 days? Everything’s on us! Just click here and start your journey!
Summary
In this blog, we’ve walked you through MySQL’s implementation of UTF-8. Contrary to a popular belief, UTF-8 within MySQL is not really the “real” UTF-8 since it doesn’t support the right amount of bytes per character – MySQL has fixed its ways starting from MySQL 8, however, if you don’t use MySQL 8 and still want full UTF-8 support, please consider using utf8mb4 instead of plain utf8.
Also, consider using database management and SQL client tools to make your work with databases easier – since DbVisualizer is the preferred database tool for world’s leading companies, we recommend you grab an evaluation trial and start using it right away! Come back to our blog for more news around the database space, and until next time.
FAQs
What Is UTF-8?
UTF-8 is one of the most popular encoding methods within relational database management systems and beyond – it translates any Unicode-based character into a string.
What Encoding Should I Use In MySQL? Why?
You should use utf8mb4 since it can support up to 4 bytes per character (utf8mb3 supports up to 3, and the original utf8 encoding available within MySQL doesn’t even support 3.)