> In the UNIX world, narrow strings are considered UTF-8 by default almost everywhere. Because of that, the author of the file copy utility would not need to care about Unicode
It couldn’t be further from the truth. Unix paths don’t need to be valid UTF-8 and most programs happily pipe the mess through into text that should be valid. (Windows filenames don’t have to be proper UTF-16 either)
Rust is one of the few programming languages that correctly doesn’t treat file paths as strings.
> It couldn’t be further from the truth. Unix paths don’t need to be valid UTF-8 and most programs happily pipe the mess through into text that should be valid. (Windows filenames don’t have to be proper UTF-16 either)
A decent fraction of software can impose rules on the portion of the filesystem within their control. A tool like mv or vim has to be prepared to handle any filepath encoding. But something like a VCS could reasonably insist that they only support filetrees with normalized UTF-8 encoding and no case-insensitive conflicts as the only things reliably working cross-platform.
The history of Git and Subversion handling filenames makes me think that the opposite is true: A VCS which doesn't handle arbitrary byte-strings will have weird edge cases which prevent users from adding files or accessing them, possibly even “losing” data in a local checkout. This is especially tedious because it'll appear to work for a while until someone first tries to commit an unusual file or checks it out with a previously-unused client.
My understanding is, you can't treat the filename as an arbitrary bytestring, since you have to transcode it across platforms, otherwise the filename won't show up properly everywhere. E.G. if I make a file named "test" on unix, it will be UTF-8 (assuming sane unix). If on windows I create a file with the filename "test", encoded as UTF-8, it will show up as worthless garbage in explorer.exe since it will decode it to UTF-16.
So VCS needs to know the filename encoding in order to work properly.
The actual text isn't an arbitrary byte string. There is logical data and then there is its representation. char, short, int, string can all logically refer to the number 0 but the representation is completely different. With char it is even possible to represent the same number in two ways. As a binary 0 or as the character code for 0. Allowing byte strings as the physical representation is not a bad idea to stay future proof but you will have to provide additional information by storing the character encoding that was used to create the arbitrary byte string. If you fail to do that then this information will have to provided through convention and that's how we get "stuck" with UTF-8 and I although I like UTF-8 this doesn't feel like the right solution. If everyone agrees to use UTF-8 then we should stop pretending that something is just an arbitrary byte string and formalize UTF-8.
The idea of an arbitrary byte string is fooling people into believing something that is not true. Developers falsely think their software can handle any character encoding. However, once you decide to support only a single character encoding you will notice that if something better comes along you need a way to differentiate the old and new codec. Then you decide to add a field that declares the character encoding type and suddenly it's obvious that your arbitrary byte string is a bad way of dealing with the problem. That byte string has meaning. Don't throw that meaning away.
Sure, as long as you don't have to be compatible with anything else, you can assume whatever encoding you want. That doesn't change the point that general programs can't make that assumption.
Yet, your shell will treat them like UTF-8 just as well. As will the standard library of almost every programming language, as you noticed.
If you open one such file in most text editors, they will render whatever is in it as UTF-8. If you use text manipulating utilities, they will work with it as if it was encoded in UTF-8.
It's mostly the Linux kernel that disagrees. Everything else considers them UTF-8.
At least for source-based Linux distributions (Gentoo, Exherbo) I remember that you have to define the locales you want to use and which ones should be the default. And when I build a system without UTF-8 locales, I doubt that the shell will treat paths as UTF-8.
Shell is like most of the programs doesn't need to bother about encoding of filenames. Mostly doesn't. I could use LANG=C and then TAB autocompletes filenames even cyrillic ones, because bash wouldn't mind encodings: terminal uses utf8, so it could output utf8 without any help from bash. Though it nevertheless pain to work with this sometimes, because readline fails to count visible characters (counting bytes instead). You type chars into command line, fill it to the end, then cursor jumps to the left side of the terminal and continues, placing characters over other characters. It is like \r used instead of \n.
`LANG=C ls` tries to be smarter and uses escape-syntax for everything except printable ASCII characters. But other utilities from coreutils work even with a locale that doesn't match file name encoding. cp, mv, grep, ...
The point is: it doesn't matter what encoding strings use until you tried to render string on a screen.
Which is a silly position since the kernel is the only thing that matters. You're right that not too many people will complain if your program crashes on non-UTF-8 paths. Same with spaces in group names. 100% valid and accepted. Breaks a ridiculous amount of software if you actually do it.
But that doesn't mean it's right. It just means that we have a calcified convention.
> narrow strings are considered UTF-8 by default almost everywhere
It means that this is mostly true.
I dunno what it should be. There are benefits and costs on both allowing and restricting the names. As well as there are good reason for the kernel alone to support them even tough all the userland doesn't. But it does mean that you just use UTF-8 and it's done.
Exactly. And they still refuse to acknowledge that treating public names, like a file path, as binary only is a wellknown security issue. Names are identifiers and must be recognizable.
With utf8 it is trivial to create similar looking names and fool the user to think it is a valid name. You know this concept from domain names, using punicode as escape mechanism. But both the kernel and the various libc's are too lazy to treat confusables with escapes, to normalize unicode or to use proper unicode security mechanisms for identifiers. Like mixing scripts, right to left and such.
Eg searching a file path needs to follow unicode rules, as we are dealing with identifiers. I believe my libc, the safeclib, is the only one even offering such functionality.
Likewise the presentation layer on the UI (shell, windows) doesn't present confusables as such, but happily takes i18n seriously. Convenience first, security last.
Apple's previous HFS+ normalized names, the new one is insecure again.
> Rust is one of the few programming languages that correctly doesn’t treat file paths as strings.
Imagine if languages allowed subtypes of strings which are not directly assignment compatible.
HtmlString
SqlString
String
A String could be converted to HtmlString not by assignment, but through a function call, which escapes characters that the browser would recognize as markup.
Similarly a String would be converted to a SqlString via a function.
It would be difficult to accidentally mix up strings because they would be assignment incompatible without the functions that translate them.
There could be mixed "languages" within a string. Like a JSP or PHP that might contain scripting snippets, and also JavaScript and CSS snippets, each with different syntax rules and escaping conventions.
It's absolutely useful enough, it's just that it's awful in C++ due to language limitations as opposed to other languages such as Haskell, where it is standard.
How would be awful in c++? It seems trivial to do, basic_string is already templated and distinct instantiations are not mutually compatible by default. In fact wstring, u8string, u16string, u32string exist today in the language simply as distinct instatiantions of basic_string. You can crate your own by picking a new char type. Algorithms can be and are, generic and work on any string type.
Not quite at that level, but rust does have OsStrings (managed the same way as the OS, often but not always utf8), and CStrings (basically just byte buffers - just like c likes). There are special rules around inclusion of nulls and null terminators. It'll give the benefits of the behaviour you mentioned - not allowing an invalid string type for a function call.
The sqlx crate for rust also has a macro called query!, which (at compile time) validates the SQL and created a value of type "record". Similar idea there, since you'll get early exceptions thrown by the compiler if you write sql with errors in it.
Go is like that. Not the "mixed within" part, though html/template's AST understands the context where you're using a value and escapes it differently. For example, https://golang.org/pkg/html/template/#HTML
Yes. But having the compiler enforce it is your first line of defense. If it doesn't compile, you know there is an actual problem. In modern IDEs, you see these compile errors as quickly as you type them.
This pattern (newtyping) is a huge weakness of Java in general, and even more so older Java, and people who like newtyping are not going to like java.
Because creating newtypes in Java is
1. verbose, defining a trivial wrapper takes half a dozen lines before you've even done anything
2. slow, because you're paying for the overhead of an extra allocation and pointer indirection every time, unless you jump through unreadable hoops making for even more verbose newtypes[0]
It is a much more convenient (and thus frequent) pattern in languages like Haskell. Or Rust.
I used Pascal for the 80's and part of the 90's. Currently use Java. I almost tried Delphi, but my shop moved on to something else between Pascal and Java.
Now the string types have an encoding and the string themselves, too. When you assign a string to a string variable with a type of a different encoding, the string is automatically converted.
But it is causing a huge mess. Especially with existing code. When you have a library using utf-8 and one library using the default codepage, that is not valid anymore. Although you can manually override the encoding for each string, so any string might have any encoding regardless of its type.
I have a benchmark of various maps in freepascal. The benchmark creates strings of random bytes to use as keys.
A classic key-value store is the sorted TStringList.
Now the benchmark of the TStringList fails. Apparently, because it now assumes the keys are valid utf-8 when using the utf-8 codepage as default codepage.
The default codepage can be changed. When I start the benchmark with LANG=C .. it works with the random byte keys. On Windows, the default codepage is usually latin1, so it would work there, too.
> It couldn’t be further from the truth. Unix paths don’t need to be valid UTF-8
Yes but, most programs expect to be able to print filepaths at least under some circumstances, like printing error messages. Even if a program is fully correct and doesn't assume an encoding in normal operation, it still has to assume one for printing. Filepaths that aren't utf-8 lead to a bunch of ����� in your output (at best). So I think it's fair to say that Unix paths are assumed to be utf-8 by almost all programs, even if being invalid utf-8 doesn't actually cause a correct program to crash.
In the Rust std one can easily use the lossless presentation with file APIs, and print a lossy version in error messages. I find this to be good enough.
I dunno. That sounds like proposing to render "foo.txt" as "Zm9vLnR4dA==" or "[102, 111, 111, 46, 116, 120, 116]" or something. I think you probably meant something like "print the regular characters if the string is UTF-8, or a lossless fallback representation of the bytes otherwise." That's a good idea, and I think a lot of programs do that, but at the same time "if the string is UTF-8" is problematic. There's no reliable way for us to know what strings are or are not intended to be decoded as UTF-8, because non-UTF-8 encodings can coincidentally produce valid UTF-8 bytes. For example, the two characters "&!" are the same bytes in UTF-8 as the character "Ω" is in UTF-16. This works in Python:
So I think I want to claim something a bit stronger:
1) Users demand, quite rightly, to be able to read paths as text.
2) There is no reliable way to determine the encoding of a string, just by looking at its bytes. And Unix doesn't provide any other metadata.
3) Therefore, useful Unix programs must assume that any path that could be UTF-8, is UTF-8, for the purpose of displaying it to the user.
Maybe in an alternate reality, the system locale could've been the reliable source of truth for string encodings? But of course if we were starting from scratch today, we'd just mandate UTF-8 and be done with it :)
> 2) There is no reliable way to determine the encoding of a string, just by looking at its bytes. And Unix doesn't provide any other metadata. 3) Therefore, useful Unix programs must assume that any path that could be UTF-8, is UTF-8, for the purpose of displaying it to the user.
No, there is locale settings (in envvars) and software should assume path encoding based on locale encoding.
It is true that today locale setting is usualy utf-8 based, but if i use non-utf-8 based locale then tools should not assume paths are in utf-8 and recode in.
No, the proposal is not for crazy encoding schemes, like for domain names, that's up to the presentation layer.
The need is to follow the unicode security guidelines for identifiers. A path is an identifier, not binary chunk. Thus it needs to follow some rules. Lately some filesystem drivers agreed, but it's still totally insecure all over.
OSX will most likely barf at or mangle invalid file names (HFS+ requires well-formed UTF-16, which translates to well-formed UTF-8 at the POSIX layer), and there are ZFS systems which are configured with utf8only set.
It would be more precise to say that you can't assume UNIX paths are anything other than garbage.
Yes, but the only way to interop multiple scripts on a POSIX filesystem is to use UTF-8. I can forgive people for not realizing that filenames in POSIX are a weird animal: they are NUL-terminated strings of characters (char) in some arbitrary codeset and encoding, but US-ASCII '/' is special.
EDIT: Also, "considered UTF-8 by default almost everywhere" is... not necessarily wrong -- nowadays users should be using UTF-8 locales by default. Maybe "almost everywhere" is an exaggeration, but I wouldn't really know.
> Unix paths don’t need to be valid UTF-8 and most programs happily pipe the mess through into text that should be valid
How about a new mount option utf8_only? When that is set on a volume, the VFS would block any attempt to create a new file/directory if the name isn't valid UTF-8. (Pre-existing file/directories with invalid UTF-8 can still be accessed.) Distributions could set it by default on all filesystems, but a user could turn it off if it caused a problem for them (which in practice is probably going to be rare.)
One could also have a flag set on the filesystem (e.g. in the superblock) similar to utf8_only. It could only be set at filesystem creation time. If it is set, then any invalid UTF-8 in a filename is a filesystem corruption which fsck could repair. A filesystem with such a flag set would ban invalid UTF-8 irrespective of any utf8_only mount option.
If we are going to ban invalid UTF-8, it would be a good idea for security reasons to ban C0 controls as well (i.e. all characters in range U+0001 to U+001F), see [1]. This could be included in the utf8_only mount option / filesystem flag, or be an independent mount option / filesystem flag. If going with the same flag for both, maybe "sane_filenames_only" might be a better name.
(Actually, for security, one should ban the UTF-8 encodings of the C1 controls as well... the CSI character U+009B might be interpreted as an ESC[ by some applications, which could have nefarious consequences. Likewise, the APC (application program command) and OSC (operating system command) characters could cause security issues, although in practice support for them is rather limited, which limits the scope of the security issues they pose.)
And a lucky thing too; OSes that do have UTF-8 filesystems don’t always agree on how to apply canonicalization, much less how to deal with canonicalization differences between user entered data and normalized filesystem names.
They are pretty straightforward: they are just path structures rather than path names that may turn into single strings when supplied to your kernel. Or, depending on the OS maybe only part of the name is turned into a string and part determines which device or syntax applies. All of which is abstracted away by the path objects.
Back in the 1970s when thins first appeared on lisp machines is was not uncommon to use remote file systems transparently, and those remote file systems could be on quite different OSes like ITS, TOPS10 or -20, VMS, one of the lisp machine file systems and even Unix (though Networking came quite late to Unix). “MC:GUMBY; FOO >” and “OZ:<GUMBY>FOO.TXT;0” were perfectly reasonable filenames. Some of those systems had file versioning built into them. So if the world likes like Unix to you some of that additional expressive power could be confusing.
C++17 path support is a neutered version of Common Lisp’s.
(Seriously though, is it pathnames you don't understand or logical hosts? Because CL pathnames are actually pretty straightforward. Logical hosts, on the other hand, are a hot mess.)
? (type-of "this is a string")
(SIMPLE-BASE-STRING 16)
? (type-of #P"/this/is/a/pathname")
PATHNAME
You can't perform string operations on a pathname.
? (subseq "This is a string" 5 15)
"is a strin"
? (subseq #P"/This/is/a/pathname" 5 15)
> Error: The value #P"/This/is/a/pathname" is not of the expected type SEQUENCE.
You can perform pathname operations on a string, but only because the string is automatically converted into a pathname first.
Maybe Linux (and other OSes) should deprecate non utf8 filenames and start disallowing creating filenames that aren't valid utf-8?
It seems silly that directory entries are just binary blobs and yet 99.99% of all software I know of passes around paths as strings. We could ask all software to stop that (boil the ocean) or we could just ask the OSes to stop it (many less OSes than all the other software)
Read that quote again: 'considered UTF-8 by default almost everywhere'. It is absolutely the truth. While you can stuff non-UTF8 in, almost all of your tools will handle it badly. Even Rust programs wanting to log the file name. It is the same as considering email addresses case sensitive; technically correct, practically shooting yourself in the food.
> Rust is one of the few programming languages that correctly doesn’t treat file paths as strings.
Rust is one of a few programming languages that incorrectly treat strings as if it were a coherent concept distinct from byte buffers.
Among those, it has the distinction of not forcing file paths into this inherently incorrect model.
(In practice, if you have a type system that can distinguish arbitrary byte buffers from ones with a known encoding, that is far from the most useful thing to distinguish about them anyway.)
git will also do this, so on a fs that allowa arbitrarily byte named files, you end up with tree objects of same name which makes digging them out later "fun"
It's a reflection of the fact people aren't going to throw out existing filesystems because they aren't in a specific character encoding. There's nothing the OS can do about that, there's nothing programmers in general can do about that, and the only way to fix it is with a time machine and enough persuasion to force everyone to implement Unicode and UTF-8 to the exclusion of any other character encoding schemes.
And it would still be wrong, because the rules of what constitutes valid unicode have changed (what's a surrogate?), and also why would that be a good idea to bake into your filesystem??
It would be a very good idea to acknowledge the existence of codecs by storing the identifier of the chosen codec but forcing a specific one doesn't appear to be that useful.
> one of the few programming languages that correctly doesn’t treat file paths as strings
I hear: one of those few programming languages that, despite its vaunted type-safety, makes it possible to accidentally create a file with a completely bogus name that I won't be able to view or open correctly with half the programs on my computer.
Languages which allow arbitrary byte sequences in paths are the cause of, and solution to, all of Unix's pathname problems.
No, it’s impossible to do that accidentally. Due to its type safety. You have to be pretty explicit about passing a non-string in (all rust strings are valid utf8).
However, sometimes you're in a layer when ASCII was fine and you should just be explicit about that.
Server Name Indication (in RFC 3546) is flawed in several ways, it's a classic unused extension point for example because it has an entire field for what type of server name you mean, with only a single value for that field ever defined. But one that stands out is it uses UTF-8 encoding rather than insisting on ASCII for the server name.
You can see the reasoning - international domain names are a big deal, we should embrace Unicode. But IDNA already needed to handle all this work, the DNS A-labels are already ASCII even for IDNs.
Essentially choosing UTF-8 here only made things needlessly more complicated in a critical security component. Users, the people who IDNs were for, don't know what SNI is, and don't care how it's encoded.
Trying to figure out how to express this without making people mad at me. I think the conflation of Unicode with "plain text" might be a mistake. Don't get me wrong, Unicode serves an important purpose. But bumping the version from plain text 1.0 (ASCII) to plain text 2.0 (Unicode) introduced a ton of complexity, and there are cases where the abstractions start leaking (iterating characters etc).
With things like data archival, if I have a hard drive with the library of congress stored in ASCII, I need half a sheet of paper to understand how to decode it.
Whereas apparently UTF8 requires 7k words just to explain why it's important. And that's not even looking at the spec.
Just to be crystal clear, I'm not advocating to not use Unicode, or even use it less. I'm just saying I think it maybe shouldn't count as plain text, since it looks a lot like a relatively complicated binary format to me.
Unicode is complicated because the languages it needs to handle are, alas, complicated. UTF-8 is super simple. It's a variable-length encoding for 21-bit unsigned integers. Wikipedia gives a handy table showing how it works:
When I wrote a very primitive UTF-8 library, I really began to appreciate UTF-8's design. For example; the first byte says how many bytes the character requires. At first it was daunting, but when I put 2 and 2 together, it really opened up.
I am sure there are many aspects I am missing about UTF-8, but it is all reasonable in its design and implementation.
For reference, I was converting between code points and actual bytes, and also implemented strlen and strcmp (which for the latter the standard library apparently handles fine).
The self-synchronizing property is also very clever. If you start at an arbitrary byte, you can find the start of the next character by scanning forward a maximum of 3 bytes.
Yeah, this. I have a pat "Unicode Rant" that boils down to this essentially.
Having a catalog of standard numbers-to-glyphs (or symbols or whatever, little pictures humans use to communicate with) is awesome and useful (and all ASCII ever was) but trying to digitalize all of human language is much much more challenging.
But human language doesn't stop being "much much more challenging" if you decide not to engage.
Sometimes (and this can even be an admirable choice) in some specialist applications it's acceptable to decide you won't embrace the complexity of human language. But in a lot of places where that's fine we already did this with the decimal digits such as in telephone numbers, or UPC/EAN product codes, so we don't need ASCII.
In most other places insisting upon ASCII is just an annoying limitation, it's annoying not being able to write your sister's name in the name of the JPEG file, regardless of whether her name is 林鳳嬌 or Jenny Smith, and it jumps out at you if the product you're using is OK with Jenny Smith but not 林鳳嬌.
You might think well, OK, but there weren't problems in ASCII. The complexity is Unicode's fault. Think about Sarah O'Connor? That apostrophe will often break people's software without any help from Unicode.
Your sister's name doesn't render in my browser (stable Firefox on Linux 5.6). I'm sure I'm missing a fontpack or something. Again, I'm not saying ASCII is the solution, I'm saying Unicode is much more difficult to get right, and maybe we should call it something other than "plain text", since we already had a generally accepted meaning for that for many years. I'm usually in favor of making a new name for a thing rather than overloading an old name.
Firefox does full font fallback. So this means your system just isn't capable of rendering her name (which yes you might be able to fix if you wanted to by installing font packages). If you don't understand Han characters that's an acceptable situation, the dotted boxes (which I assume rendered instead) alert you that there is something here you can't display properly but if you know you can't understand it even if it's displayed there's no need to bother.
It really is just plain text. Human writing systems were always this hard, and "for many years" what you had were separate independent understandings of what "plain text" means in different environments, which makes interoperability impossible. Unicode is mostly about having only one "plain text" rather than dozens.
It is not mandatory that your 80x25 terminal learn how to display Linear B, you can't read Linear B and you probably have no desire to learn how and no interest in any text written in it. But Unicode means your computer agrees with everybody else's computer that it's Linear B, and not a bunch of symbols for drawing Space Invaders, or the manufacturer's logo, if you fix a typo in a document I wrote that has some Linear B in it, your computer doesn't replace the Linear B with question marks, or erase the document, since it knows what that is even if you can't read it and it doesn't know how to display it.
But I'm not saying we shouldn't engage, I'm just pointing out that the catalog of lil pictures is the easy part of the task.
One way I put it is, imagine if one of the first-class outputs of the Unicode Consortium was standard libraries for different human languages for different computer languages.
As a person who comes from a country with non-ASCII alphabet, I strongly disagree. Since UTF-8 became de-facto standard everywhere, so many headaches went away.
That complexity comes from the fact that you are using non ASCII characters. UTF8 is a superset of standard ASCII. If you are using only standard ASCII characters, they're exactly the same thing.
And you're naïve if you think ASCII suffices for English. I wouldn't give you ½¢ for an OS incapable of handling Unicode and UTF-8 even if you told me every language other than English were mysteriously destroyed. Going back to ASCII is 180° from what would enrich English-language text.
> You only need one sentence to explain why ASCII isn't sufficient
Nitpick: ASCII is sufficient when you consider that Base64, despite its 33% overhead from representing 6 bits with 8 bits, makes life easier for certain classes of software.
What I was alluding to is, I often convert any binary data, including text, to Base64 to avoid dealing with cross platform, cross language, cross format, cross storage, cross network data-handling. Only the layer that needs to deal with the blob's actual string representation needs to worry about encoding schemes that are outside the purview of the humble ASCII table.
Base64 encodes sextets. The mapping from octets to sextets is mostly settled for set of three octets at a time, but the situation for lengths not divisible by 6 is a mess.
ASCII is English and limiting access to knowledge for the rest of humanity for a simpler encoding is just not an acceptable option. Someone needs to interpret those 7k words and write a (complicated?) program once so that billions can read in their own language? Sounds like an easy win to me.
Sure spoken, but both Arabic and CJK ideograms are written in far more countries in the world, with far more people, and for far longer in history than the ASCII set. The oldest surviving great works of Mathematics were written in Arabic and some of the oldest surviving great works of Poetry where written in Chinese, as just two easy and obvious examples of things worth preserving in "plain text".
Playing the devil's advocate here. I am not a native English speaker, I'm a French speaker, but I'm happy that English is kind of the default international language. It's a relatively simple language. I actually make less grammar mistakes in English than I do in my native language. I suppose it's probably not a politically correct thing to say, the English are the colonists, the invaders, the oppressors, but eh, maybe it's also kind of a nice thing for world peace, if there is one relatively simple language that's accessible to everyone?
Go ahead and make nice libraries that support Unicode effectively, but I think it's fair game, for a small software development shop (or a one-person programming project), to support ASCII only for some basic software projects. Things are of course different when you're talking about governments providing essential services, etc.
I know almost no one who actually types the accented e, let alone the c with the cedilla. I scarcely ever see the degree symbol typed. Rather, I see facade, cafe, and "degrees".
That aside, the big problem with unicode is not those characters; they're a simple two-byte extension. They obey the simple bijective mapping of binary character <-> character on screen. Unicode doesn't. You have to deal with multiple code points representing one on-screen grapheme, which in turn may or may not translate into a single on-screen glyph. Also bi-directional text, or even vertical text (see the recent post about Mongolian script). Unicode is still probably one of the better solutions possible, but there's a reason you don't see it everywhere: it means not just updating to wide chars but having to deal with a text shaper, re-do your interfaces, and tons of other messy stuff. It's very easy for most people to look at that and ask why they'd bother if only a tiny percentage of users use, say, vertical text.
The first point is just because of the keys on a keyboard.
I see many uses of "pounds" or "GBP" on HN. Anyone with the symbol on the keyboard (British and Irish obviously, plus several other European countries) types £. When people use a phone keyboard, and a long-press or symbol view shows $, £ and €, they can choose £.
Danish people use ½ and § (and £). These keys are labelled on the standard Danish Windows keyboard.
There's plenty of scope for implementing enough Unicode to support most Latin-like languages without going as far as supporting vertical or RTL text.
For some reason people seem to think that the only options are UTF-8 and ASCII. That choice never existed. There are thousands upon thousands of character encodings in use. Before Unicode every single writing system had its own character encoding that is incompatible with everything else.
You didn't say spoken by every person. Merely spoken in every country. Even the existence of tourists in a country would pass this incredibly low bar...
Of course ASCII is simpler than Unicode, it handles only 127 characters. If you restrict yourself to those characters ASCII is binary equivalent to UTF-8.
So yeah, maybe you shouldn't use characters 128+ for data archival, I doubt that's a good idea, but that's irrelevant to whether UTF-8 is plain text or not.
I think that sometimes it makes sense to enforce strict limitations early on (eg: overly strict input validation). You can then remove such limitations in later versions of your software, after careful consideration and after inserting the necessary tests. The reverse usually doesn't work. If you didn't have those limitations early on, and your database is full of strings with characters that should never have been allowed in there, you will have a hard time cleaning up the mess.
This seems especially true to me in the design of programming languages. If you have useless, badly thought out features in your programming language, people will begin to rely on them, and you will never be able to get rid of them... So start with a small language, and make it strict. Grow it gradually.
There are tens of thousands of characters in all the human scripts. If you're a librarian, scholar, researcher -- why would you not want to be able to use them seamlessly??
If there was a complicated tool that claimed it could do the job of every tool in history, or a simple tool that was focused to cover 99% of the work you do-- and we lived on planet earth-- which would you choose?
As I understand it, it's impossible to have a txt file that uses Japanese and Chinese characters at the same time. The file will either use the Chinese or Japanese forms of the characters, depending on your font. I would think this is a big gotcha people must run into all the time, but I never hear anyone talk about it.
I’m not going to try and minimize the problem, here. Han unification was pushed through by western interests, by my understanding.
However, most Unicode characters are identical or nearly identical in Chinese and Japanese. Characters with “significant” visual differences got encoded as different Unicode characters. The same thing applies to simplified and traditional Chinese characters.
So for a given “Han character”, there might be between one and three different Unicode characters, and there might be between one and three different ways of writing it.
So the issue does come up when mixing Chinese and Japanese text, but it’s not really one that has a big impact on legibility of the text but you would definitely be concerned if you were writing a Japanese textbook for Chinese students, or vice versa.
Beyond that, it is usually fairly trivial to distinguish between Japanese and Chinese text, so you could just lean on simple heuristics to get the work done (Japanese text, with the exception of fairly ancient text or very short fragments, contains kana, but Chinese does not).
Han unification was pushed through by western interests, by my understanding.
Note that as far as I'm aware, the interest in question was the initial 16-bit limit of the character set and later on the non-proliferation of competing standards.
Also note that while Han unification is the most prominent example, there are technically similar cases, which just aren't as charged culturally. For one, Unicode doesn't encode German Fraktur: While some characters are available due to their use in mathematics, it's lacking the corresponding variants of ä, ö, ü, ß, ſ as well as specific ligatures. So if you want to intermix modern with old German writing, you'll also have to go out-of-band.
Let's not excuse the utter irresponsibility of deciding on 16 bits: the initial 16-bit limit of the character set is instantly invalidated by looking at any comprehensive Chinese character dictionary, no reasonable choice of which will give you an estimate of under about 30k characters, even excluding graphical variants.
Even assuming that we discount 80k+ estimates by collapsing graphical variants, that's over half of your code space right off the bat. For this to seem like a seem like a good idea, you'd need to assume that Chinese is a uniquely bad one-off case. Not a good bet to stake your character set on.
It's actually exactly the same thing. The Han Unification didn't smash together unrelated squiggles that just happened to look similar, they were semantically the same - scholars of the Han writing system spent a bunch of time deciding what is or is not the same squiggle just drawn differently, like Fraktur, and today people are annoyed because, as you'd expect some of them believed that "style of fonts" was integral to the meaning anyway.
Chinese characters represent the Chinese words or parts thereof, Japanese ones represent Japanese words and parts thereof. That is a semantic difference.
So what you're saying is that because 'chat' in English and 'chat' in French are quite different words with very different meanings, you believe there should be a separate letter 'c' for English and French to enable us to tell those words apart?
It is not logographic, but characters still have meaning - associtated phonemes. Although this is less clear in English, it is emphasized in other languages.
And this mapping is different between languages. So 'c' in English has different meaning to 'c' in Czech.
There are differences as well as similarities. I'm no expert, but shouldn't, say, U+4ECA still translate to 'now' no matter if you draw a particular line horizontally or diagonally? There are also some mandatory[1] ligatures in Fraktur unavailable in Unicode. What if I wanted to preserve that distinction in historic writing?
edit:
[1] I think the mandatory ones are actually there (just not in Fraktur), it's some optional ones like ſch that are missing.
> There are differences as well as similarities. I'm no expert, but shouldn't, say, U+4ECA still translate to 'now' no matter if you draw a particular line horizontally or diagonally?
No, since "now" is an English word, not a Japanese or Chinese one.
> There are also some mandatory[1] ligatures in Fraktur unavailable in Unicode.
Unicode doesn't encode ligatures except for backwards compatibility.
Of course it is. Ligatures aren't characters, they're glyphs that represent multiple characters. Unicode does not encode glyphs, that's simply not its job. No more than encoding what font to use or when to render text in italic.
Which is the whole point of Han unification, the argument being that whether or not a particular line in U+4ECA is horizontal or diagonal is just like that. What's the difference?
To the contrary: What any line in any glyph looks like is of no concern because Unicode doesn't deal with glyphs. It deals with abstract characters that don't have appearances to begin with.
"Α" and "A" look exactly the same (at least in most fonts). But each has its own code point because the GREEK CAPITAL LETTER ALPHA simply isn't the LATIN CAPITAL LETTER A or any other Latin letter.
As I understand it Han unification happened because at the time all there was was UCS-2 -no UTF-16, no UTF-8- so codespace was tight and precious, and that motivated codespace preserving optimizations, of which Han unification is the notable one.
To avoid that they needed to have invented UTF-8 many years earlier. Perhaps if the people designing UTF-8 were more diverse they might have felt the necessity to invent UTF-8 to the point of actually doing it, but then perhaps they might have done it poorly. At any rate, I don't know enough details to really know if "Han unification was pushed through by western interests" is remotely fair.
UTF-8 was sketched on a placemat as a response to a different idea. It seems likely that had it not arisen in a moment of inspiration by a genius, we would be stuck with another inferior design by committee.
I agree. But too, necessity is the mother of invention. GP seems to argue that Han unification happened because the UC was not diverse enough. Maybe, and maybe if it had been diverse enough the need would have arisen sooner. But again, the thing they came up with could have been garbage, who knows!
What I do know is that UTF-8 is genius. The Han unification problems seem mostly minor -- I suspect code can detect language and do the right thing, for example, and again, we could revive language tags if need be.
Here's some text I could write about some Japanese characters, that, thanks to Han Unification, may be confusing:
In 1946, the Japanese government created a (non-exhaustive) list of common characters, some of which were simplified from their more traditional form. One of them is 臭. Its older form was 臭. Another character that shares the same root, 嗅, was not part of that list of common characters. It was added later, in 2010, and was never simplified, such that the stroke that was removed in 臭 is still there, making it just slightly different.
If your fonts are biased towards Chinese, 臭 and 臭 will be identical, and you won't know what I'm talking about. The former is 自 above 大, the latter is 自 above 犬.
You could think the difference is trivial, but 大 is big and 犬 is dog. Not that it alters the meaning of 臭, 臭, or 嗅, but when talking about how 嗅 is not 口 alongside 臭 anymore, it does make a difference.
Yes, the real problem is when you start mixing All Four ( or Five ) of them together Chinese Traditional, Simplified Korean, Japanese things becomes extremely problematic.
I think it is by luck, All four writings has significant usage within their own region, imagine if one of them were significantly smaller and over time were forced ( or by ease of use or what ever reason ) to switch to a different style without knowing it.
First of all, there is no new unification work ongoing. The Unicode Consortium moved on from that by moving on from UCS-2. UCS-2 drove unification as a way to preserve precious codespace.
There used to be language tag codepoints for this, but they've been deprecated. Han unification is an accident of history: a result of UTF-8 not having existed until it was too late!
There's not going to be a different new Unicode for doing away with Han unification, which is why no one mentions it: besides crying about it, what else can one do? Maybe we should revive language tags?
Anyways, isn't the difference between unified Han/Kanji characters mostly stylistic rather than semantic? I'm not denying that many users would get annoyed, but again, what to do about it??
It's different enough that users will immediately complain if you get it wrong. And it means that you, as a developer who might not understand either Chinese or Japanese, now has to deal with the fallout by setting a different font in your application depending on which of the two languages it is.
This happe ed end for us in factorio, and it was super annoying, because it's really hard to spot the problem before it goes live because you A: don't know the problem exists, how would you? B: have a hard time seeing it even when you do know.
The whole poi t of Unicode is to not have to think about this crap or handle it explicitly, and this breaks that guarantee fantastically.
See the whole point about history. All you're doing is crying about it :(
Here's a question: when a native Chinese speaker reads a Japanese text, do they want to see it in Chinese style or Japanese style? If the former, then just know that that's their preference and always use their preference -- easy fix. If the latter... you need to know the language of a text (or sub-text), and that requires either language tags or language recognition.
I expect it's the latter, to make it easier to recognize foreign text, which is not necessarily easy to read. After all, native Chinese, Japanese, and Korean speakers who don't speak the other languages can only glean so much meaning from Han/Kanji text in the others' languages. That's because while often ideographic characters are used for (common) meaning, sometimes they are used for the sounds of the words they identify but not their meanings.
In the original language, of course. Why is that even a question. That is like asking if Geek people would want to read Latin script using Greek alphabet or not.
You keep using the word 'style', so you agree that α is a style of a? Then I have no more comment. It's not 'style' at all.
The same could be said whether è é should be the same as e with different fonts. People who cares about it would complain. To those who only uses English it is only the same e.
Not just different pronunciations. ê isn't about pronunciation but about indicating that "there used to be an 's' after this e". French written w/o circumflex accents doesn't change in pronunciation, not really, nor -mostly- in meaning, but it does look very annoying to French speakers as well as to native speakers of other Romance languages: the reason is that that reminder is in fact useful for translation.
I'm guessing Han unification is at least annoying like losing circumflex accents would be.
Relatively few people frequently look at different Han languages, and relatively few people are looking at txt files containing Han characters (and I expect those that do are typically running with their OS locale set to one of the Han languages?).
Enough CJK HTML content is tagged and heuristics are mostly good enough that incorrect font selection isn't a massive issue on the web, and AFAIK most major word processors include metadata in the file that suffices to distinguish language.
> Although utf8 is currently an alias for utf8mb3, at some point utf8 will become a reference to utf8mb4. To avoid ambiguity about the meaning of utf8, consider specifying utf8mb4 explicitly for character set references instead of utf8.
They probably couldn't even if they wanted to, by this point there will be too much software out there depending on "utf8" meaning "MySQL's weird proprietary hacked-up version of UTF-8".
The only real solution is to hammer home the message that "utf8mb4" is what you put into MySQL if you want UTF-8.
> For instance, ‘ch’ is two letters in English and Latin, but considered to be one letter in Czech and Slovak.
Is "ch" really considered one _character_ in Czech and Slovak? I'm Polish and we do have "ch" and consider it one ... sound... represented by two letters? I mean... if you asked anyone to count letters/characters in a word, they would count "ch" as two. So I wonder if that's different in Slovakia or Chech Republic, or is just my definition of "character" wrong.
> So I wonder if that's different in Slovakia or Chech Republic, or is just my definition of "character" wrong.
According to wikipedia, "Ch" is a character of the Czech alphabet in the sense that it impacts alphabetical ordering ("Ch" sorts between H and I), in the same way Ł or Ę are apparently characters from the Polish alphabet distinct from L and E respectively (wikipedia mentions that "być comes after bycie").
That is unlike, say, french where É and E are the same character alphabetically.
This depends on your definition of informal terms like "letter", "character" etc.
The typographic term for combinations like this is "digraph". (Wikipedia's definition: "A digraph [...] is a pair of characters used in the orthography of a language to write either a single phoneme [...] or a sequence of phonemes that does not correspond to the normal values of the two characters combined".)
Whether digraphs have separate keys on a keyboard, are treated as distinct for the purposes of alphabetisation, whether speakers of the language think of them as separate "letters" when spelling out a word and so on, are all separate issues and varies between languages (or, more precisely, between the conventions for writing a certain language).
A better example would probably be "ij" in Dutch. That's definitely considered a single letter, as words starting with ij in Dutch are capitalised IJ. Though there are glyphs for IJ /ij already in unicode.
"Ij" is also one sounds represented bij two letters, and I think capitalizing just the 'I' is pretty standard. As a Dutch person myself, I didn't even know that there's a glyph for it!
We also have "ei", which sounds the same and was invented to annoy people learning Dutch. Then there's "oe", "eu", "ui". And just to fuck even more with people learning the language, we have "au" and "ou" which also sound the same. Oh, and "ch" and "g".
Hans Brinker, the inventor of the Dutch language, famously would toss a florijn to decide between using ei/ij and au/ou, as he was not fond of foreigners. He's mostly known for saving our country though when he plugged a hole in a dyke with his finger (yes, I know what you're thinking, and no, we do not appreciate your dirty minds making light of this heroic act).
Interesting. I never really gave it much thought, but Ij actually bothers me so much that I usually try to avoid using it at the beginning of a sentence, and I cringe when I need to capitalize because it's a place (like Ijsselmeer).
Just did some googling. Turns out that unlike the other combinations, capitalizing both letters is mandatory for 'IJ'. TIL...
Nobody has that as a letter on the keyboard here though, so it doesn't matter. Normally typed as a digraph. Would be nice if we just switched over to using y at this point. Makes me wonder, is the use of diacritics reducing since ascii keyboards became the norm ?
"IJ (lowercase ij; Dutch pronunciation: [ɛi]) is a digraph of the letters i and j. Occurring in the Dutch language, it is sometimes considered a ligature, or a letter in itself. In most fonts that have a separate character for ij, the two composing parts are not connected but are separate glyphs, which are sometimes slightly kerned."
I don't know that that's correct. That there exists a ligature character doesn't mean the ligature is a character of the language.
It could, mind, I don't know dutch. But in french "œ" (which has a ligatured character as you can see) is canonically equivalent to "oe". It is not a separate letter of the alphabet even though:
* many words should not be written with the ligatured form
* many words should be written with the ligatured form
* it has a different pronunciation than the base form
Based on my experience learning Czech (not native at all, just interested):
- it's typically listed as a separate letter when writing out the alphabet
- but in practice it's typed out as "c h" and not as a single character
- it occupies its own place in Czech standard alphabetical order, my English-Czech dictionary has all the "ch" words after "h" (so interestingly in order to do a proper sort programmatically you need to possibly look 2 characters ahead)
A a native Czech speaker, i never really understood what it means that 'ch' is one letter in Czech. It is clearly two graphemes representing one phoneme, so one could think it is a digraph, but it has some special properties like being one element in collating order. I think people just started to call it one letter to have one-letter-one-sound property.
At first I though they simply mean the letter "č" but no, it turns out that "ch" (and also "dz") is a digraph with a separate place in Czech and Slovak alphabets.
I came to the same conclusion years ago. My app is Win32, but I never defined UNICODE or used the TCHAR abomination. All strings are stored as UTF8 until they are passed to Win32 APIs, whereupon they are converted to UCS-2. I explicitly call the wchar version of functions (ex: TextOutW). This strategy enabled me to transition easily and safely from single-byte ASCII (Windows 3.1) to Unicode.
Calling the "A", instead of "W" functions might be some small perf hit (don't know if it matters), but for some functionality you need to call the "W" functions, for example to break the limit of 256 or was it 260 characters, up to 32768 (or was it 16384).
Is java.lang.String still UTF-16? Is there any plan to fix that? Once Windows and Java take care of it, I can't think of any other major UTF-16 uses left. Are there any that I've forgotten about?
I don't think they can fix that without completely breaking backwards compatibility. The basic char type in Java is defined as a 16 bit wide unsigned integer value and String doesn't abstract over that.
Only for ASCII text. There is still no UTF-8 support (it's even called out as a non-goal in the JEP: "It is not a goal to use alternate encodings such as UTF-8 in the internal representation of strings.")
I don't think it's a big deal for Java because it's always easy to transfer in from and out to UTF-8. Very few Java programs use UTF-16 as a persistence format, and Java-native applications can directly marshal strings around as they are a first-class datatype.
You’re right! I’m surprised I didn’t know that. It looks like it can also be UCS-2, going by the spec:
> A conforming implementation of this International standard shall interpret characters in conformance with the Unicode Standard, Version 3.0 or later and ISO/IEC 10646-1 with either UCS-2 or UTF-16 as the adopted encoding form, implementation level 3. If the adopted ISO/IEC 10646-1 subset is not otherwise specified, it is presumed to be the BMP subset, collection 300. If the adopted encoding form is not otherwise specified, it is presumed to be the UTF-16 encoding form.
> A conforming implementation of this Standard shall interpret characters in conformance with the Unicode Standard, Version 3.0 or later and ISO/IEC 10646-1 with either UCS-2 or UTF-16 as the adopted encoding form, implementation level 3. If the adopted ISO/IEC 10646-1 subset is not otherwise specified, it is presumed to be the BMP subset, collection 300. If the adopted encoding form is not otherwise specified, it presumed to be the UTF-16 encoding form.
> A String value is a member of the String type. Each integer value in the sequence usually represents a single 16-bit unit of UTF-16 text. However, ECMAScript does not place any restrictions or requirements on the values except that they must be 16-bit unsigned integers.
That's not really possible as strings are defined in terms of char and guarantee O(1) access to UTF16 code units. They might try to switch to "indexed UTF8" (as pypy did in the Python ecosystem whereas "CPython proper" refused to switch to UTF8 with the Python 3 upheaval and went with the death trap that is PEP 393 instead).
However it's not quite unequivocal. Windows still uses UTF-16 in the kernel (or actually an array of 16bit integers, but UTF-16 is a very strong convention). The code page will often allow the Win32 API to perform the conversion back and forth instead of your application doing it.
AFAICT, it's not only "internal representation". .NET strings are defined as a sequence of UTF-16 units, including the definition of the Char type representing a single UTF-16 code unit. I can't imagine how such a change could be implemented (other than changing the internal representation but converting on all accesses which would be nonsense, I think).
Basically WTF-16 is any sequence of 16-bit integers, and is thus a superset of UTF-16 (because UTF-16 doesn't allow certain combinations of integers, mainly surrogate code points that exist outside of surrogate pairs).
Then WTF-8 is what you get if you naively transform invalid UTF-16 into UTF-8. It is a superset of UTF-8.
This is very useful when dealing with applications like Java and Javascript that treat strings as sequences of 16-bit code points, even though not all such strings are valid UTF-16.
> Basically WTF-16 is any sequence of 16-bit integers, and is thus a superset of UTF-16 (because UTF-16 doesn't allow certain combinations of integers, mainly surrogate code points that exist outside of surrogate pairs).
If WTF-16 is the ability in potentia to store and return invalid UTF-16 without signalling errors, I don't know that there's any actual UTF-16 system out there to the possible exception of… HFS+ maybe?.
That's good news. Last time I looked, more than a decade ago admittedly, that bug was WONTFIX.
In fact I was so surprised I just wrote a test program. They have fixed it!
It was the dumbest bug I ever saw in Windows. It was special case code in the console output code path of the user mode part of WriteFile. It only existed to make utf8 work, and it didn't even do that.
Ah, that's surprising, Microsoft was very stubbornly not doing that for at least a decade and a half.
In fact, the FAQ in TFA (questions 9 and 20) mentions that there are still problems with CP_UTF8 (65001). Is the article out of date? Can someone respond to those statements?
The article is outdated, it's from 2012. Not only they fixed the problems but in Windows 10 1803 they also added an option to globally and permanently set both OEM and ANSI(!) codepages to 65001.
It can be enabled by checking "Beta: Use Unicode UTF-8 for worldwide language support" checkbox in region settings.
When I used to do a lot of windows programming in the late 90s, I wish that I had a sensible guide like this for handling strings. TCHAR was always a source of subtle bugs.
I suppose, though, that the underlying problem was that Microsoft was so late to implement a compatibility solution for Windows 9x. Most software of the time ended up implementing on "ANSI" multibyte character set (MBCS) just because otherwise you would need to either deploy 2 executables or do your own thunking. This solution would be a double thunk on 9x because you'd be thunking your UTF-8 to unicode and then thunking that back to MBCS.
> When writing a UTF-8 string to a file, it is the length in bytes which is important. Counting any other type of ‘characters’ is, on the other hand, not very helpful.
So, suppose I have a UTF-8 string of n code units (bytes) length. Unfortunately my data structure only permits strings of length m < n bytes.
How do I correctly truncate the string so it doesn't become invalid UTF-8 and won't show any unexpected gibberish when rendered? (E.g., the truncated string doesn't suddenly contain any glyphs or grapheme clusters that weren't in the original string)
> How do I correctly truncate the string so it doesn't become invalid UTF-8 and won't show any unexpected gibberish when rendered? (E.g., the truncated string doesn't suddenly contain any glyphs or grapheme clusters that weren't in the original string)
Cropping strings is a hard problem for ASCII strings as well. It can even be a security problem if the cropped part contains important information that alters the meaning of the first part (Something like "DELETE FROM table_name [WHERE condition]" or natural language where the cropped part is the condition or the negation).
But even if you dont care about this: If you care about cropping visually nicely, you want some ellipsis at the end, you dont want to crop in the middle of the word (if possible), etc. In the end, you need some nice text processing anyway.
Avoiding invalid UTF-8 is easy, almost trivial: just make sure you don't truncate in the middle of a code point.
The latter is fiendishly difficult to get right in all cases, the ugliest case being emoji flags. Being all-or-nothing on both sides of a ZWJ will get you most of the way there, however.
What do you think distinguishes the truncated string containing "glyphs or grapheme clusters" that weren't in the original string from the truncated string containing words that weren't in the original string? Is the latter somehow more acceptable? How about missing necessary context from the end of a sentence?
Refuse to accept a string that is overlong, and require an interactive user (hopefully one literate in the language) to truncate it for you. In a non-interactive context, you can't.
As someone who experienced serious pain with broken strings that I sometimes only discovered, after the original files were gone and new special characters were integrated, I directed quite some anger to the fact, that computer systems are internal mostly operated in english only, so usually nobody notices bugs with wrong character encoding. So I share the sentiment of the article ..
I do not want to think about UTF encoding, when I simply create a 7z or tar file, without even programming. But I learned the hard way, I had to. I never even found out for example, if it was/is a bug with 7z, tar, rsync, scite text editor/ notepad++ .. or just wrong usage/configuration. I just had(and still have even now my workflow is clean) a special first file/codeline with special characters, I checked to be correct, after compressing, rsyncing between different systems. Especially between windows and linux. But it probably helps, that I don't have to do that anymore.
> Many third-party libraries for Windows do not support Unicode: they accept narrow string parameters and pass them to the ANSI API. Sometimes, even for file names. In the general case, it is impossible to work around this, as a string may not be representable completely in any ANSI code page (if it contains characters from a mix of Unicode blocks). What is normally done by Windows programmers for file names is getting an 8.3 path to the file (if it already exists) and feeding it into such a library. It is not possible if the library is supposed to create a non-existing file.
Yikes. That's a fascinating use of 8.3 paths. Sometimes when I look at really old Windows cruft I wonder when it will go away. 8.3 paths seemed like an easy thing to get rid of, but with 8.3 paths used to hack around encoding issues in 3rd party libraries... that's going to stick around...
Anyone know which libraries this is talking about?
> Q: What do you think about Byte Order Marks? A: According to the Unicode Standard (v6.2, p.30): "Use of a BOM is neither required nor recommended for UTF-8". [...] Using BOMs would require all existing code to be aware of them, even in simple scenarios as file concatenation. This is unacceptable.
Then your site "UTF-8 everywhere" is misnamed, because standards-following UTF-8 can have a BOM.
It's not required or recommended, but it is possible and allowable, so you might see them and if you follow the standard you have to deal with them. It's not a matter of "this would require all existing code to handle them" - that is not hypothetical, that is the current world, to be standards-compliant all existing code does already need to be aware of them. It isn't, which means it's broken. Declaring it "unacceptable" is meaningless, except to say you're rejecting the standard and doing something incompatible and broken because it's easier.
Which is a position one can take and defend, but it's not a good position for a site claiming to be pushing for people to follow the standard. What it is, is yet another non-standard ad-hoc variant defined by what some subset of tools the authors use can/can't handle in April 2020.
> "the UTF-8 BOM exists only to manifest that this is a UTF-8 stream"
Throwing the word "only" in there doesn't make it go away. It exists as a standards-compliant way to distinguish UTF-8 from ASCII, not recommended but not forbidden.
> "A: Are you serious about not supporting all of Unicode in your software design? And, if you are going to support it anyway, how does the fact that non-BMP characters are rare practically change anything"
Well in the same way, how does the fact that UTF8+BOM is rare practically change anything? At some level you're either pushing for everyone to follow standards even if it's inconvenient because that makes life better for everyone overall, like you are with surrogate pairs and indexing, or you're creating another ad-hoc incompatible variation of UTF-8 which you prefer to the standard and trying to strong-arm everyone else into using it with threats of being incompatible with all the code which already does it wrong.
Being wary of Chesterton's Fence, presumably there's some company or system which got UTF-8+BOM added to the standard because they wanted it, or needed it.
> using BOMs would require all existing code to be aware of them, even in simple scenarios as file concatenation
Absolutely! Any app that writes UTF-files can (and probably should) avoid writing them. But any program that reads UTF files must handle a BOM. A lot of apps write UTF-8 including the BOM by default, for example Visual Studio.
You can NOT concatenate two UTF-8 streams and expect that the resulting stream is also a valid UTF-8 stream. NO tool should assume that, ever.
> You can NOT concatenate two UTF-8 streams and expect that the resulting stream is also a valid UTF-8 stream.
Actually you can; the ability to concatenate UTF-8 streams is an intentionally part of the design of UTF-8. The BOM is an ordinary Unicode code point and can occur in the middle of a valid UTF-8 stream, where it should be treated as either a zero-width non-breaking space or an unsupported character (which only affects rendering). So concatenating two UTF-8 streams with leading BOMs still results in a valid UTF-8 stream, albeit with an extra zero-width space.
The bigger problem with the BOM is that it breaks transparent compatibility with ASCII. Absent a leading BOM character, a UTF-8 steam containing only codepoints 0-127 is binary-identical to an ASCII-encoded text stream and can be handled with tools that are not UTF-8 aware. This was an explicit design consideration for both Unicode and UTF-8. Add the BOM, however, and your file is no longer plain text, which can lead to syntax errors or other issues that are difficult to diagnose because the BOM is invisible in UTF-8 aware text editors.
I think the BOM was a mistake—along with the variable-length multi-byte encodings it was created to support—but unfortunately at this point we're stuck with it. (Actually the BOM is prohibited in the multi-byte formats with an explicit byte order, like UTF-16BE; it would have been really nice if the same policy had been applied to UTF-8 where byte order is irrelevant.) The best we can do is recommend that new programs omit the BOM when outputting UTF-8 and either skip it at the beginning or convert it to U+2060 WORD JOINER anywhere else when it appears in the input.
Interesting, I thought a BOM-in-the-middle was invalid. I know apps are even more likely to choke on that than a leading BOM though.
In any case, you need to handle it in every app that claims to read UTF. The loss of compatibility is indeed the biggest problem and I agree the BOM should be omitted when possible, but that doesn’t change that it’s part of the spec and millions of UTF files have a BOM.
Even if 100% of all apps stopped using a BOM today you couldn’t ignore it in a parser.
Downvoting doesn't make the BOM stop being part of the standard either, btw.
Yes, supporting BOM on arbitrary UTF-8 streams is varying between difficult and impossible, but then get it removed from the standard, or state that you don't support the standard. Don't pretend you support the standard while ignoring the bits you don't like, that's dishonest and unhelpful.
I'd argue for some standard tests for UTF-8 strings:
- Basic - UTF-8 byte syntax correct.
- Unambiguous - similar to the rules for Unicode domain names. The rules are complicated, but basically they prohibit homoglyphs, mixing glyphs from different character sets, forwards and backwards modifiers in the same string, no emoji or modifiers, etc. Use where people have to visually compare two things for identity or retype them, such as file names.
- Unambiguous, light version - as above, but allow emoji and modifiers. Normal form for documents.
Still doesn't solve the fact that filesystems across different OS's allow invalid UTF8 sequences in the filenames.
Maybe 99% of apps do not care, but even a simple "cp" tool should care. Filenames (and maybe other named resoureces) should be treated completely differently, and not blindly assumed that they are utf8 compatible.
> 2) Bye bye backward compatibility and interoperability
It's already not really a thing.
Traditional unices allow arbitrary bytes with the exception of 00 and 2f, NTFS allows arbitrary utf-16 code units (including unpaired surrogates) with the exception of 0000 and 002f, and I think HFS+ requires valid UTF-16 and allows everything (including NUL).
The OS then adds its own limitations e.g. win32 forbids \, :, *, ", ?, <, >, | (as well as a few special names I think) and OSX forbids 0000 and 003a (":"), the latter of which gets converted to and from "/" (and similarly forbidden) by the POSIX compatibility layer.
The latter is really weird to see in action, if you have access to an OSX machine: open a terminal, try to create a file called "/" and it'll fail. Now create one called ":". Switch over to the Finder, and you'll see that that file is now called "/" (and creating a file called ":" fails).
Oh yeah and ZFS doesn't really care but can require that all paths be valid UTF8 (by setting the utf8only flag).
> Traditional unices allow arbitrary bytes with the exception of 00 and 2f, NTFS allows arbitrary utf-16 code units (including unpaired surrogates) with the exception of 0000 and 002f.
For just Windows -> Linux you can represent everything by mapping WTF-16 to WTF-8.
It sounds like they're saying the opposite. All programs dealing with filenames need to be able to support an arbitrary stream of bytes, they can't just assume UTF-8.
1) Nope.
2) Yes, we need to keep backward compatibility.
What I'm saying is that promoting UTF8 everywhere, without specifically stressing the fact that filesystems (in general) do no observe UFT8, leads to API/LIB designs that lack good support there.
Path/filename/dirname/whatever should be a different kind of "string".
Backward compatibility is a laudable goal and is not to be broken lightly. But sometimes, things are so fundamentally broken that we would be far better off with a clean break.
Interoperability is quite possibly a good argument for coming up with some reasonable restrictions on filenames. Today you could easily (case sensitive names, special characters, etc.) create a ZIP file or similar that cannot be successfully extracted on this platform or that.
In an excellent article, David A. Wheeler [1] lays out a compelling case against the status quo. TL;DR: bad filenames are too hard to handle correctly. Programs, standards, and operating systems already assume there are no bad filenames. Your programs will fail in numerous ways when they encounter bad filenames. Some of these failures are security problems.
He concludes: "In sum: It’d be far better if filenames were more limited so that they would be safer and easier to use. This would eliminate a whole class of errors and vulnerabilities in programs that “look correct” but subtly fail when unusual filenames are created (possibly by attackers)." He goes on to consider many ideas towards getting to this goal.
To me, that's a design flaw. Would we really be any worse off if we simply declared filenames must be UTF-8?
That seems to be the only case where a user-visible and user-editable field is allowed to be an arbitrary byte sequence, and its primary purpose seems to be allowing this argument to pop up on HN every month.
I've never seen any non-malicious use of it. All popular filesystems already disallow specific sets of ASCII characters in names. Any database which needs to save data in files by number has no problem using safe hex filenames.
Sure we could declare that but then what? Non-unicode filenames won't suddenly disappear. Operating systems won't suddenly enforce unicode. Filesystems will still allow non-unicode names.
Simply declaring it doesn't help anybody. In the meantime your application still needs to handle non-unicode filenames otherwise those malicious ones are free to be malicious.
I'd assume that the proper place for defining what's a valid filename would be on the filesystem level, so a filesystem of standard ABC v123 would not allow non-unicode names; so non-unicode filenames would either get refused or modified upon copying/writing them to the filesystem.
This is not new, this would match the current behavior of the OS/filesystem enforcing other character restrictions such as when writing (for example) a file name with an asterisk or colon to a FAT32 USB flash drive.
What is this C++ `narrow()/widen()` function mentioned in the Windows section? At the risk of asking to be spoonfed, can someone give the source code of a function that takes a UTF-8 `std::string` and gives a UTF-16 `std::wstring`?
> In the UNIX world, narrow strings are considered UTF-8 by default almost everywhere
I think in unix world, null terminated strings are the default. It doesn't need to be valid UTF-8 even. For display purposes, the shell uses the locale setting
I love the typesetting on the page. It is content-first, clean, and simple.
It lacks all the usual noise like modal dialogs, headers and footers, social media icons, colorful sidebars, newsletter sign-ups, cookie warnings, etc.
I'd be happy if I could just get consistent encoding. Have to handle way too many files with mixed encoding, even XML files with explicit encoding header.
It is a pain in the ass to have a variable number of bytes per char.
In Ascii, you could easily know every character personally. No strange surprises.
Also no surprises while reading black on white text and suddenly being confronted with clors [1].
[1] Also no surprises when writing a comment on HN like this one and having some characters stripped. I put in a smiley as the firs "o" in colors, but it was stripped out. Looks like the makers of HN don't like UTF-8 either.
You're conflating code points and some encoding; more importantly, you're conflating "array of encoded objects (bytes)" for "a string of text". They're not — and never have been — the same.
> It is a pain in the ass to have a variable number of bytes per char.
Maybe, but nobody can stomach the wasted space you get with UTF-32 in almost every situation. The encoding time tradeoff was considered less objectionable than making most of your text twice or four times larger.
And as the article points out, even then you might have more than one code point for a character.
> For example, the only way to represent the abstract character ю́ cyrillic small letter yu with acute is by the sequence U+044E cyrillic small letter yu followed by U+0301 combining acute accent.
You can't even write proper English in ASCII. ASCII is an absolute dead end. It's history.
Actually representing human language is HARD. It is also absolutely necessary. Whatever solution you choose is going to be complicated, because it is solving a very complicated problem.
Throwing your hands up and going "oh this is too hard, I don't like it" will get you nowhere.
ASCII doesn't have a direct representation of all the punctuation used in English print, like 66 99 quotes, and different kinds of dashes (distinct from minus). For non-print, it's entirely fine.
Typesetting should be handled by a markup language anyway. Adding a few characters to Notepad doesn't create a typesetting system. A typesetting system needs to be able to do kerning, ligatures, justification. Not to mention bold, italics, and different fonts.
> It is a pain in the ass to have a variable number of bytes per char.
This is from API & language mistakes more than an issue with UTF-8 itself.
If you actually design your API & system around being UTF-8, like Rust did, then there's really no issue for the programmer. The API enforces the rules, and still gives you things like a simple character iterator (with characters being 32-bit, so that it actually fits: https://doc.rust-lang.org/std/char/index.html). The String class handles all the multi-byte stuff for you, you never "see" it: https://doc.rust-lang.org/std/string/struct.String.html
Retrofitting this into existing languages isn't going to be easy, but that's not an excuse to not do it at all, either.
For parsing text-based formats, UTF-8 has the nice property that the encoded byte sequence of a character is not a subsequence of the encoding for any other chracter or sequence of other characters. This means splitting on byte sequences of UTF-8 works just as well as spliting on code points.
And for text editing you need to deal with grapheme clusers anyway, which can be made up of a variable number of code points - so having these be made up of a variable number of bytes doesn't make anything worse.
It's not as straightforward or sensible as you think. It's case insensitive; it's case preserving; and C0 control characters, SPC, and DEL are allowed. The case differentiating bits for letters are nowadays sometimes used in an attempt to foil attackers. If you want things to look back on and say "I think that X was a mistake." then forget UTF of any stripe. The DNS is full of them.
This pops up every so often, and is wrong on several fronts (UNIX is UTF-8, UTF-8/32 lexicographically sort, etc.) There's not really a good reason to support UTF-8 over UTF-16; you can quibble over byte order (just pick one) and you can try and make an argument about everything being markup (it's not), but the fact is that UTF-16 is a more efficient encoding for the languages a plurality of people use natively.
But more broadly, being able to assume $encoding everywhere is unrealistic. Write your programs/whatevers allowing your users to be aware of and configure encodings. It might not be ideal, but such is life.
> There's not really a good reason to support UTF-8 over UTF-16
Two big reasons:
1. All legal ASCII text is UTF-8. That means upgrading ASCII to UTF-8 to support i18n doesn't require you to convert all your files that were in ASCII.
2. UTF-16 gives people the mistaken impression that characters are fixed-width instead of variable-width, and this causes things to break horribly on non-BMP data. I've seen amusing examples of this.
> Write your programs/whatevers allowing your users to be aware of and configure encodings.
Internally, your program should be using UTF-8 (or UTF-16 if you have to for legacy reasons), and you should convert from non-Unicode charsets as soon as possible. But if you're emitting stuff... you should try hard to make sure that UTF-8 is the only output charset you have to support. Letting people select non-UTF-8 charsets for output adds lots of complication (now you have to have error paths for characters that can't be emitted), and you need to have strong justification for why your code needs that complication.
> 1. All legal ASCII text is UTF-8. That means upgrading ASCII to UTF-8 to support i18n doesn't require you to convert all your files that were in ASCII.
Eh, realistically if you're doing this, you should be validating it like converting from one encoding to another anyway. I get that people won't and haven't, but that's because UTF-8 has this anti-feature where ASCII is compatible with it, and that's led to a lot of problems.
> 2. UTF-16 gives people the mistaken impression that characters are fixed-width instead of variable-width, and this causes things to break horribly on non-BMP data. I've seen amusing examples of this.
This is one of those problems, and it's way worse with UTF-8 because it encodes ASCII the same way ASCII does. It's let programmers stay naive about this stuff for... decades?
> Internally, your program should be using UTF-8 (or UTF-16 if you have to for legacy reasons), and you should convert from non-Unicode charsets as soon as possible.
There are all kinds of reasons to not use UTF-8. tialaramex pointed out one above. "UTF-8 everywhere" is simply unrealistic, and it forces a lot of applications to be slower, or to take on unnecessary complexity. Maybe it's worth it to "never have to think about encodings again", but that's pretty hard to verify and there's no way it happens in our lifetimes anyway.
> and you need to have strong justification for why your code needs that complication.
Yeah see, I strongly disagree with this. I'll choose whatever encoding I like, thanks. Maybe you don't mean to be super prescriptive here, but I think a little more consideration by UTF-8 advocates wouldn't hurt.
If everyone chooses whatever encoding they like, then the charset being used has to be encoded somewhere. The problem is, there are lots of places where charset isn't encoded (such as your filesystem). That this is a problem can be missed, because almost all charsets are a strict superset of ASCII (UTF-{7,16} are the only such charsets to be found in the top 99.99% of usage), so it's only when you try your first non-ASCII characters that problems emerge.
Unicode has its share of issues, but at this point, Unicode is the standard for dealing with text, and all i18n-aware code is going to be built on Unicode internally. The only safe way to handle text that has even the remotest change of being i18n-aware is to work with charsets that support all of Unicode, and given its compatibility with ASCII, UTF-8 is the most reasonable one to pick.
If you want to insist on using KOI-8, or ISO-2022-JP, or ISO-8859-1, you're implicitly saying "fuck you" to 2/3 of the world's population since you can't support tasks as basic as "let me write my name" for them.
> If everyone chooses whatever encoding they like, then the charset being used has to be encoded somewhere.
This is gonna be the case for the foreseeable future, as you point out. Settling on one encoding only fixes this like, 100 years from now. I'd prefer to build encoding-aware software that solves this problem now.
> given its compatibility with ASCII, UTF-8 is the most reasonable one to pick
This only makes sense of your system is ASCII in the first place, and if you can't build encoding-aware software. I think we can both agree that's essentially legacy ASCII software, so you don't get to choose anything anyway. And any system that interacts with it should be encoding-aware and still validate the encoding anyway, as though it might be BIG5 or whatever. Assuming ASCII/UTF-8 is a bad idea, always and forever.
> If you want to insist on using KOI-8, or ISO-2022-JP, or ISO-8859-1, you're implicitly saying "fuck you" to 2/3 of the world's population since you can't support tasks as basic as "let me write my name" for them.
I'm not obligated to write software for every possible user at every point in time. It's perfectly acceptable for me to say, "I'm writing this program for my 1 friend who speaks Spanish" and have that be my requirements. But if I were to write software that had a hope of being broadly useful, UTF-8 everywhere doesn't get me there. I'd have to build it to be encoding-aware, and let my users configure the encoding(s) it uses.
> But if I were to write software that had a hope of being broadly useful, UTF-8 everywhere doesn't get me there.
Actually, it does.
Right now, in 2020, if you're writing a new programming language, you can insist that the input files must be valid UTF-8 or it's a compiler error. If you're writing a localization tool, you can insist that the localization files be valid UTF-8 or it's an error. Even if you're writing a compiler for an existing language (e.g., C), it would not be unreasonable to say that the source file must be valid UTF-8 or it's an error--and let those not using UTF-8 right now handle it by converting their source code to use UTF-8. And this has been the case for a decade or so.
That's the point of UTF-8 everywhere: if you don't have legacy concerns [someone actively using a non-ASCII, non-UTF-8 charset that you have to support], force UTF-8 and be done with it. And if you do have legacy concerns, try to push people to using UTF-8 anyways (e.g., default to UTF-8).
I can't insist that other systems send your program UTF-8, or that the users' OS use UTF-8 for filenames and file contents, or that data in databases uses UTF-8, or that the UTF-8 you might get is always valid. The end result of all these things you're raising is "you can't assume, you have to check always, UTF-8 everywhere buys you nothing". Even if we did somehow get there, you'd still have to validate it.
> not really a good reason to support UTF-8 over UTF-16
Of course there is, the fact that if you're dealing only with ASCII characters then it's backwards-compatible. Which is a nice convenience in a great number of situations programmers encounter.
The minor details of efficiency of an encoding these days isn't particularly relevant -- sure UTF-16 is better for Chinese, but the average webpage usually does have way more markup, CSS and JavaScript than text, and gzip-ing it on delivery will result in a similar payload totally independent of the encoding you choose.
UTF-8's ASCII compatibility is an anti-feature; it's allowed us to continue to use systems that are encoding naive (in practice ASCII-only). It's no substitute for creating encoding-aware programs, libraries, and systems.
The vast majority of text is not in HTML or XML, and there's no reason you can't use Chinese characters in JavaScript besides (your strings and variable/class/component/file names will surely outpace your use of keywords).
It's not an anti-feature, it's a benefit that is a huge asset in the real world. For example, you can be on a legacy ASCII system, inspect a modern UTF-8 file, and if it's in a Latin language then it will still be readable as opposed to gibberish. Yes all modern tools should be (and these days generally are) encoding-aware, but in the real world we're stuck with a lot of legacy tools too.
And of course the vast majority of transmitted digital text is in HTML and similar! What do you think it's in instead?
By sheer quantity of digital words consumed by the average person, it's news and social media delivered in browsers (HTML), followed by apps (still using HTML markup to a huge degree) and ebooks (ePub based on HTML). And of course plenty of JSON and XML wrapping too.
And of course you can you Chinese characters in JavaScript/JSON, but development teams are increasingly international and English is the de-facto lingua franca.
That huge asset has become a liability. We always needed to become encoding-aware, but UTF-8's ASCII compatibility has let us delay it for decades, and caused exactly the confusion causing us to debate right now. So many engineers have been foiled by putting off learning about encodings. Joel Spolsky wrote an article, Atwood wrote an article, Python made a backwards incompatible change, etc. etc. etc.
To be honest, I'm just guessing about what text is stored in--I'll cop to it being very hard to prove. But my guess is the vast majority of text is in old binary formats, executables, log files, firmware, or in databases without markup. That's pretty much all your webpages right there.
n.b. JSON doesn't really fit the markup argument. The whole idea is that HTML is super noisy and the noise is 1 byte in UTF-8, and 2 bytes in UTF-16. JSON isn't noisy so the overhead is very low.
You can't rewrite all existing legacy software to support encodings. You just can't. A backwards-compatible format was a huge catalyst for widely supporting Unicode in the first place. What exactly are we delaying for decades? Engineers everywhere use Unicode today for new software. The battle has been won, moving forwards.
And the vast majority of text isn't in computer code or even books. It's in the seemingly endless stream of content produced by journalists and social media each and every day, dwarfing executables, firmware, etc. And if it supports any kind of formatting (bold/italics etc.) -- which most does -- then it's virtually always stored in HTML or similar (XML). I mean, what are even the alternatives? Neither RTF nor Markdown come even close in terms of adoption.
> You can't rewrite all existing legacy software to support encodings. You just can't. A backwards-compatible format was a huge catalyst for widely supporting Unicode in the first place.
Totally agree.
> What exactly are we delaying for decades?
Learning how encodings work and using that knowledge to write encoding-aware software.
> Engineers everywhere use Unicode today for new software. The battle has been won, moving forwards.
They do, but they're frequently foiled by on-disk encodings, filenames, internal string formats, network data, etc. etc. etc. All this stuff is outlined in TFA.
> And the vast majority of text isn't in computer code or even books. It's in the seemingly endless stream of content produced by journalists and social media each and every day
I concede I'm not likely to convince you here, but like, do you think Twitter is storing markup in their persistence layer? I doubt it. And even if there is some formatting, we're talking about <b> here, not huge amounts of angle brackets.
But think about any car display. That's probably not markup. Think about ATMs. Log files. Bank records. Court records. Label makers. Airport signage. Road signage. University presses.
The reasons most programmers use English in their source code has nothing to do with file size (for that their are JS minimizes) or supported encodings. It has to do with that two things, English is the most used language in the industry so if you want to cooperate with programmers from other parts of the world English is a good idea and because it frankly looks ugly to mix languages in the same file so when the standard library is in English your source code will be too.
So since most source code is in English (and for JS is minimized) UTF-8 works perfectly there too.
I think it's quite obvious that UTF-8 is the better choice over UTF-16 or UTF-32 for exchanging data (if just for the little/big endian mess alone, and that UTF-16 isn't a fixed-length encoding either).
From that perspective, keeping the data in UTF-8 for most of its lifetime also when loaded into a program, and only convert "at the last minute" when talking to underlying operating system APIs makes a lot of sense, except for some very specific application types which do heavy text processing.
I'm gonna do little quotes but, I don't mean to be passive aggressive. It's just that this stuff comes up all the time
> I think it's quite obvious that UTF-8 is the better choice over UTF-16 or UTF-32 for exchanging data (if just for the little/big endian mess alone...
This should be the responsibility of a string library internally, and if you're saving data to disk or sending it over the network, you should be serializing to a specific format. That format can be UTF-8, or it can be whatever, depending on your application's needs.
> and that UTF-16 isn't a fixed-length encoding either)
We should stop assuming any string data is a fixed-length encoding. This is a major disadvantage of UTF-8, because it allows for this conflation.
> keeping the data in UTF-8 for most of its lifetime also when loaded into a program, and only convert "at the last minute" when talking to underlying operating system APIs makes a lot of sense, except for some very specific application types which do heavy text processing.
Well, you're essentially saying "I know about your use case better than you do". It might be important to me to not blow space on UTF-8. But if my platform/libraries have bought into "UTF-8 everywhere" and don't give me knobs to configure the encoding, I have no recourse.
And that's the entire basis for this. It's "having to mess with encodings is worse than the application-specific benefits of being able to choose an encoding". I think that's... at best an impossible claim and at worst pretty arrogant. Again here I don't mean you, but this "UTF-8 everywhere" thing.
>We should stop assuming any string data is a fixed-length encoding. This is a major disadvantage of UTF-8, because it allows for this conflation.
Mistaking a variable-width encoding for a fixed-width one is specifically a UTF-16 problem. UTF-8 is so obviously not fixed-width that such an error could not happen by a mistake, because even before widespread use of emojis, multibyte sequences were not in any way a corner case for UTF-8 text (for additional reference, compare UTF-16 String APIs in Java/JavaScript/etc. with UTF-8 ones in, say, Rust and Go, and see which ones allow you to easily split a string where you shouldn't be able to, or access "half-chars" as a datatype called "char".)
I mean, I think we're both in the realm of [citation needed] here. I would argue that people index into strings quite a lot--whether that's because we thought UCS-2 would be enough for anybody or UTF-8 == ASCII and "it's probably fine" is academic. The solution is the same though: don't index into strings, don't assume an encoding until you've validated. That makes any "advantage" UTF-8 has disappear.
If you really think no one made this mistake with UTF-8, just read up on Python 3.
The difference is that with UTF-8 you're much more likely to trip over those bugs in random testing. With UTF-16 you're likely to pass all your test cases if you didn't think to include a non-BMP character somewhere. Then someone feeds you an emoji character and you blow up.
Yeah, ASCII is such a powerful mental model that I think anyone working with Unicode made a lot of concessions to convert people, no argument there. But I think we need to say we're done with that and move on to phase 2. Here's what I advocate:
- Encodings should be configurable. Programmers get to decide what format their strings are internally, users get to decide what encoding programs use when dealing with filenames or saving data to disk, etc. Defaults matter, and we should employ smarts, but we should never say "I know best" and remove those knobs.
- Engineers need to internalize that "strings" conceal mountains of complexity (because written language is complex), and default to using libraries, to manage them. We should start view manual string manipulation as an anti-pattern. There isn't an encoding out there that we can all standardize on that makes this untrue, again because written language is complex.
But is it really a plurality? Portuguese, English, Spanish, Turkish, Vietnamese, French, Indonesian and German are stored more efficiently in UTF-8 while Chinese, Korean and Japanese are stored less effeciently. My gut feel is that more people use the Latin script than people using CJK scripts. Indic scripts, Thai, Cyrillic, etc are stored using two bytes in both UTF-8 AND UTF-16.
Looking at the basic multilingual plane [1], UTF-8 will use > 2 bytes to encode essentially anything that isn't:
* ASCII/Latin
* Cyrillic
* Greek
* Most of Arabic
That leaves out:
* China
* India
* Japan
* Korea
* All of Southeast Asia
Re: markup, think about any text that's in a database, stored in RAM, or stored on a disk--relatively little of it will be in noisy ASCII markup formats like HTML or XML.
It couldn’t be further from the truth. Unix paths don’t need to be valid UTF-8 and most programs happily pipe the mess through into text that should be valid. (Windows filenames don’t have to be proper UTF-16 either)
Rust is one of the few programming languages that correctly doesn’t treat file paths as strings.