utf8_general_ci is a lot faster than utf8_unicode_ci and it's good enough for wh...

Larrikin · on Feb 28, 2014

Good enough when you ignore non English speakers is not good enough

coldtea · on Feb 28, 2014

For a lot of uses cases it very much is. Even ascii can work for some. Not all tables/DBs have to do with spoken language data.

dmethvin · on Feb 28, 2014

But a lot of databases have to deal with people's names. To paraphrase an old saying, "I don't care why I'm in the database, as long as they spell my name right."

joshfraser · on Feb 28, 2014

utf8_general_ci will still spell your name correctly, it just might not sort it correctly if you use an ORDER BY

_delirium · on Feb 28, 2014

For highly multilingual databases sort order is basically a lost cause anyway, because there is no globally correct, multilingual Unicode sort order. Even when using the same alphabet different languages have different sorting conventions. And some languages' correct sorting order is not even fully decidable solely from the UTF8 text. For example, Danish treats 'aa' as a sequence of two 'a' letters sometimes (mainly in words and names of non-Danish origin), and as a variant spelling of 'å' other times. So in a properly collated Danish encyclopedia, Aachen goes near the beginning of the encyclopedia, while Aalborg goes near the end. Good luck implementing that in your database!

rogerbinns · on March 1, 2014

The unicode people have addressed this with the Unicode Collation Algorithm - http://www.unicode.org/reports/tr10/ - which obviously can't be perfect, but it can be reasonable.

The ICU project - http://site.icu-project.org/ - has open source implementations of the collation algorithm including appropriate information for different locales. ie you shows those Danish names to a Danish user in their expected sort order, while also showing them to an American in their expected sort order.

i18n and l10n is hard. But it is also largely solved fairly well, and there is no excuse to avoid it all together or not use the ICU code.

bjxrn · on Feb 28, 2014

In that case, why not use latin1? Or hey, why not just ASCII?

joshfraser · on Feb 28, 2014

We're discussing how well each collation SORTS international strings, not whether or not you can store them. There are a lot of instances where you need to be able to store an international string without needing to be able to do a highly accurate ORDER BY on that column.

It's also worth mentioning that by design, UTF-8 doesn't use any more space for storage than ASCII. There are exceptions when databases need to pre-allocate storage, but in general, you should just be using UTF-8 everywhere.