MySQL Çorbası: UTF-8 Desteği

Giriş

Eskiden varsayılan "character encoding" utf8 idi. Ancak artık utf8mb4.

Yani özetlersek CREATE TABLE ile

CHARACTER SET olarak utf8mb4 ve COLLATE olarak utf8mb4_general_ci kullanmak lazım

utf8 vs. utf8mb4

Açıklaması şöyle

The core reason for the separation of utf8 and utf8mb4 is that UTF-8 is different from proper UTF-8 encoding. That's the case because UTF-8 doesn't offer full Unicode support, which can lead to data loss or even security issues. UTF-8's failure to fully support Unicode is the real kicker - the UTF-8 encoding needs up to four bytes per character, while the "utf8" encoding offered by MySQL only supports three. See the issue on that front? In other words, if we want to store smilies represented like so:

We cannot do it - it's not that MySQL will store it in a format of "???" or similar, but it won't store it altogether and will respond with an error message like the following:
Incorrect string value: '\x77\xD0' for column 'demo_column' at row 1

With this error message, MySQL is saying "well, I don't recognize the characters that this smiley is made out of. Sorry, nothing I can do here" - at this point, you might be wondering what is being done to overcome such a problem.
...
That workaround is called "utf8mb4". utf8mb4 is pretty much the same as its older counterpart - utf8 - it's just that the encoding uses one to four bytes per character which essentially means that it's able to support a wider variety of symbols and characters.

Collation

Collation yani metinlerin sıralanması

1. Veri tabanı seviyesinde

2. Tablo seviyesinde

3. Sütun seviyesinde

yapılabilir

utf8_general_ci collation

Character encoding olarak utf8 kullanıyorsak varsayılan collation utf8_general_ci

utf8mb4_general_ci collation

Character encoding olarak utf8mb4 kullanıyorsak varsayılan collation utf8mb4_general_ci.

Açıklaması şöyle. Burada ci uzantısı "case insensitive" anlamına geliyor. Yani sorting ve comparison işlemlerinde bu kullanılıyor

- utf8mb4_general_ci is geared towards a more "general" use of MySQL and utf8. This character set is widely regarded to take "shortcuts" towards data storage which may result in sorting errors in some cases to improve speed.

utf8mb4_general_ci kullanılınca karşılaşılan hatalardan birisi unique constraint hatası. Açıklaması şöyle.

That is, "Fred" and "freD" are considered equal at the database level. If you have a unique constraint on a field, it would be illegal to try to insert both "aa" and "AA" into the same column, since they compare as equal (and, hence, non-unique) with the default collation. If you want case-sensitive comparisons on a particular column or table, change the column or table to use the utf8_bin collation.

Şöyle yaparız

CREATE DATABASE demo_db CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;
USE tests;

CREATE TABLE demo_tbl (
  'archtype_field' VARCHAR(100) DEFAULT NULL
) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;

utf8mb4_unicode_ci collation

Açıklaması şöyle. Burada ci uzantısı "case insensitive" anlamına geliyor. Yani sorting ve comparison işlemlerinde bu kullanılıyor

utf8mb4_unicode_ci is geared towards "advanced" users - that is, it's a set of collations that is based on Unicode and we can rest assured that our data will be dealt with properly if this collation is in use.

MySQL Çorbası

15 Ağustos 2022 Pazartesi

UTF-8 Desteği

Hiç yorum yok:

Yorum Gönder

LIMIT ve Covering Index + Subquery

Kötüye Kullanım Bildir

Etiketler