Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

utf8_general_ci is a lot faster than utf8_unicode_ci and it's good enough for whenever you don't need to sort international strings.


Good enough when you ignore non English speakers is not good enough


For a lot of uses cases it very much is. Even ascii can work for some. Not all tables/DBs have to do with spoken language data.


But a lot of databases have to deal with people's names. To paraphrase an old saying, "I don't care why I'm in the database, as long as they spell my name right."


utf8_general_ci will still spell your name correctly, it just might not sort it correctly if you use an ORDER BY


For highly multilingual databases sort order is basically a lost cause anyway, because there is no globally correct, multilingual Unicode sort order. Even when using the same alphabet different languages have different sorting conventions. And some languages' correct sorting order is not even fully decidable solely from the UTF8 text. For example, Danish treats 'aa' as a sequence of two 'a' letters sometimes (mainly in words and names of non-Danish origin), and as a variant spelling of 'å' other times. So in a properly collated Danish encyclopedia, Aachen goes near the beginning of the encyclopedia, while Aalborg goes near the end. Good luck implementing that in your database!


The unicode people have addressed this with the Unicode Collation Algorithm - http://www.unicode.org/reports/tr10/ - which obviously can't be perfect, but it can be reasonable.

The ICU project - http://site.icu-project.org/ - has open source implementations of the collation algorithm including appropriate information for different locales. ie you shows those Danish names to a Danish user in their expected sort order, while also showing them to an American in their expected sort order.

i18n and l10n is hard. But it is also largely solved fairly well, and there is no excuse to avoid it all together or not use the ICU code.


In that case, why not use latin1? Or hey, why not just ASCII?


We're discussing how well each collation SORTS international strings, not whether or not you can store them. There are a lot of instances where you need to be able to store an international string without needing to be able to do a highly accurate ORDER BY on that column.

It's also worth mentioning that by design, UTF-8 doesn't use any more space for storage than ASCII. There are exceptions when databases need to pre-allocate storage, but in general, you should just be using UTF-8 everywhere.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: