Character sets in MySQL – still not for Windows users.

by Aug 25, 2012

Recent MySQL versions (first the chaotic series of releases that preceeded 5.5 – 5.2, 6.0 and 5.4 – and now 5.6) adds new 'character sets' to MySQL. But little of it is useful.

Let us take it from the beginning: before 4.1 MySQL supported a wide range of single-byte character sets: regional ones ('latin1'/Western, 'latin2'/Central-european, arabic etc.) as well as strictly national ones (hebrew, 'armsci'/Armenian, 'tis60'/Thai etc.) and also a few multibyte *non-unicode* character sets for Chinese, Korean and Japanese. 4.1 added the support for Unicode to MySQL with the UTF8 and UCS2 charsets. Since then UTF16, UTF8MB4 (they are useful to a limited amount of users) and also UTF32 was added and in early 5.6 UTF16LE (and they are not useful at all)

What does it mean? Let's start with a Wikipedia quote.
"Unicode can be implemented by different character encodings. The most commonly used encodings are UTF-8, UTF-16 and the now-obsolete UCS-2. UTF-8 uses one byte for any ASCII characters, which have the same code values in both UTF-8 and ASCII encoding, and up to four bytes for other characters. UCS-2 uses a 16-bit code unit (two 8-bit bytes) for each character but cannot encode every character in the current Unicode standard. UTF-16 extends UCS-2 .. to handle each of the additional characters." This is actually a little incomplete. For both UCS2 and UTF16 the byte order may differ. The most significant byte may come first or last. This is most often called "big endian byte order" and "little endian byte order" respectively.

The significant Unicode encodings used today (we ignore here implementations specific for IBM mainframes etc.) are
1) UTF8 (native Unicode format on Linux and most often best unicode option for storing data with MySQL on all platforms)
2) UTF16 with big-endian byte-order (native unicode format on traditional Unix systems – including Mac OS-X and Solaris).
3) UTF16 with little-endian byte-order (native unicode format on Windows 2K+ – often referred to as 'Windows Unicode').
And actually 'endian-ness' of the encoding also applies to UCS2 like this: little-endian UCS2 was the only unicode encoding that Windows 9x/ME could handle and some old Unix'es may not handle the UTF16 extensions to UCS2.

I think every Windows user has noticed the 'save as' dialog in Windows programs like Notepad. Here there is a control (a 'dropdown') where user can select the encoding the file shall be saved with. Windows will propose "ANSI" as long as file can be saved without loss with an single copepage -encoding matching the current LOCALE setting. If not it will propose "Unicode" (Windows' term for 'UTF16 with little-endian byte-order'), but user may select UTF8 or big-endian UTF16 as well. However not all Windows program has this option. Some will only save to ANSI and  'Windows Unicode' and read files encoded with same.

So Windows users need an option to read and write files encoded with 'Windows Unicode'. You would expect that LOAD DATA and SELECT INTO OUTFILE had the option in MySQL. They don't.


b) Second: There is no MySQL character set identical to 'Windows Unicode' .. well STOP. It is there. In 5.6 it is there. Refer "utf16le, the UTF-16LE encoding for the Unicode character set; like utf16 but little-endian rather than big-endian." Hooray – so we can now specify that character set with LOAD DATA!  🙂

.. but 🙁 ..

1) (same page): "The only collations available are utf16le_general_ci and utf16le_bin. These are similar to utf16_general_ci and utf16_bin."

2) "ucs2, utf16, utf16le, and utf32 cannot be used as a client character set, which means that they also do not work for SET NAMES or SET CHARACTER SET."

3) "Note .. It is not possible to load data files that use the ucs2, utf16, utf16le, or utf32 character set."

It is the 3rd point that matters most. Nobody needs neither UTF32 nor UTF16LE on the server. It is irrelevant. UTF8 and UCS2 (or rather UTF8MB4 and UTF16 from MySQL 5.5 – but that is another story) do all what is required for users that want to use unicode. UTF16 is the best option for most users I think, but UCS2/UTF16 has the advantage for Asian users that national characters will only consume 2 bytes on disk and in memory – with UTF8 they will consume 3 bytes.

When we need other formats/encodings it is in order to interact with the Operating System/File System (and *BOTH* the server *AND* client Operating System/File System) as well as applications requiring files of the native unicode encoding on client or server platform. There is no other need for them.

So of the character sets added since 4.1 the only UTF8MB4 and UTF16 charsets (that both add support for Unicode characters outside the 'basic multilingual plane'  to MySQL) are a little useful. But only useful for a small fraction of users as most will never use such characters at all.

The effort with UTF16LE till now is no less than a joke IMO. We don't need to store data with that encoding in tables or CAST/CONVERT to/from it. What we need is to read and write files with that encoding from both server system and client system ("LOAD DATA [LOCAL] INFILE .." and "SELECT .. INTO [LOCAL] OUTFILE ..").