How about assume utf-8, and if someone has some binary file they'd rather a program interpret as some other format, they turn it into utf-8 using a standalone program first. Instead of burning this guess-what-bytes-they-might-like nonsense into all the software.
We don't go "oh that input that's supposed to be json? It looks like a malformed csv file, let's silently have a go at fixing that up for you". Or at least we shouldn't, some software probably does.
I doubt you can handle UTF-8 properly with that attitude.
The problems is, there is one very popular OS which is very hard to enforce UTF-8 everywhere, Microsoft Windows.
It's very hard to ensure all the software stack you are depending on it use Unicode version of Win32 API. Actually the native character encoding in Windows is UTF-16 so you can't just assume UTF-8. If you're writing low level code, you have to convert UTF-8 to UTF-16 and back. Even if you don't you have to ensure all the low level code you are depending on it do the same for you.
Oh and don't forget about the Unicode Normalizations. There is no THE UTF-8. There are bunch of UTF-8s with different Unicode normalizations. Apple macOS use NFD while other mostly use NFC.
These are Just some examples. When people living in ASCII world casually said "I just assume UTF-8", in reality, you still assume it's ASCII.
> Actually the native character encoding in Windows is UTF-16 so you can't just assume UTF-8. If you're writing low level code, you have to convert UTF-8 to UTF-16 and back.
Yes. You should convert your strings. Thankfully, UTF-16 is very difficult to confuse with UTF-8 because they're completely incompatible encodings. Conversion is (or should be) a relatively simple process in basically any modern language or environment. And personally, I've never run into a problem where the difference between NFC and NFD mattered. (Do you have an example?). The different forms are (or should be) visually completely identical for the user - at least on modern computers with decent unicode fonts.
The largest problem with UTF-8 (and its biggest strength) is how similar it is to ASCII. It is for this reason we should consider emoji to be a wonderful gift to software correctness everywhere. Correctly handling emoji requires that your software can handle unicode correctly - because they need multi-unit encoding with both UTF-16 and UTF-8. And emoji won't render correctly unless your software can also handle grapheme clusters.
> When people living in ASCII world casually said "I just assume UTF-8", in reality, you still assume it's ASCII.
Check! If your application deals with text, throw your favorite multi-codepoint emoji into your unit testing data. (Mine is the polar bear). Users love emoji, and your software should handle it correctly. There's no excuse! Even the windows filesystem passes this test today.
My native language uses some additional CJK chars on plane 2, and before ~2010s a lot of software had glitches beyond the basic plane of unicode. I am forever grateful for the "Gen Z" who pushed for Emojis.
Javascript's String.length is still semantically broken though. Too bad it's part of a unchangeable spec...
There's no definition of String.length that would be the obvious right choice. It depends on the use case. So probably better to let the application provide its own implementation.
> So probably better to let the application provide its own implementation.
I’d be very happy with the standard library providing multiple “length” functions for strings. Generally I want three:
- Length in bytes of the utf-8 encoded form. Eg useful for http’s content-length field.
- Number of Unicode codepoints in the text. This is useful for cursor positions, CRDT work, and some other stuff.
- Number of grapheme clusters in the text when displayed.
These should all be reasonably easy to query. But they’re all different functions. They just so happen to return the same result on (most) ascii text. (I’m not sure how many grapheme clusters \0 or a bell is).
Javascript’s string.length is particularly useless because it isn’t even any of the above methods. It returns the number of bytes needed to encode the string as UTF16, divided by 2. I’ve never wanted to know that. It’s a totally useless measure. Deceptively useless, because it’s right there and it works fine so long as your strings only ever contain ascii. Last I checked, C# and Java strings have the same bug.
The built-in string.length method is useless (it returns the number of char objects) and I agree that's a problem, but the solution is also built into the language, unlike in JS.
JS these days also has ways to iterate over codepoints and grapheme clusters. If you treat the string as an iterator, then its elements will be single-codepoint strings, on which you can call .codePointAt(0) to get the values. (The major JS engines can allegedly elide the allocations for this.) The codepoint count can be obtained most simply with [...string].length, or more efficiently by looping over the iterator manually.
The Intl.Segmenter API [0] can similarly yield iterable objects with all the grapheme clusters of a string. Also, the TextEncoder [1] and TextDecoder [2] APIs can be used to convert strings to and from UTF-8 byte arrays.
JavaScript’s recently added implementation of String[Symbol.iterator] iterates through Unicode characters. So for example, [...str] will split any string into a list of Unicode scalar values.
Yep. I don't use eslint, but if I did I would want a lint against any use of string.length. Its almost never what you want. Especially now that javascript supports unicode through [...str].
String.length is fine, since it counts UTF16 (UCS2?) code units. The attribute was only accidentally useful for telling how many characters were in a string for a long time, so people think it should work that way.
> I've never run into a problem where the difference between NFC and NFD mattered. (Do you have an example?)
The main place I've seen it get annoying is searching for some text in some other text. Unless you normalize the data you're searching through the same way as you normalize your search string.
After reading the above comment I went looking for Unicode documentation talking about the different normalisation formats. That was a point I read that surprised me because I hadn’t ever thought of it. They said search should be insensitive to normalisation form - so generally you should normalise all text before running a search.
That’s a great tip - obvious in hindsight but one I’d never considered.
> your software should handle it correctly. There's no excuse!
It is valid for the presentation of compound emoji can fallback to their component parts. You can't expect every platform to have an up to date database of every novel combination. A better test is emoji with color modifiers. Another good one is grandfathered symbols with both a text and emoji presentation and forcing the chosen glyph with the variant selector prefix.
> You can't expect every platform to have an up to date database of every novel combination.
On modern desktop OSes and smart phones, I do expect my platform to have an up-to-date unicode database & font set. Certainly for something like the unicode polar bear, which was added in 2020. I'll begrudgingly look the other way for terminals, embedded systems and maybe video games... but generally it should just work everywhere.
Server code generally shouldn't interact with unicode grapheme clusters at all. I'm struggling to think of any common, valid reason to use a unicode character database in 'normal' backend server code.
> Another good one is grandfathered symbols with both a text and emoji presentation and forcing the chosen glyph with the variant selector prefix.
I didn't know about that one. I'll have to try it out.
I thing being continuously updated should be tied to receiving new external content.
It's fine to have an embedded device that's never updated, but never receives new content - it doesn't matter that a system won't be able to show a new emoji because it doesn't have any content that uses that new emoji.
However, if it is expected to display new and updated content from the internet, then the system itself has to be able to get updated and actually get updated, there's no acceptable excuses for that - if it's going to pull new content, it must also pull new updates for itself.
As the user/owner of the device, no thanks. It should have code updates if and only if I ask it to, which I probably won't unless you have some compelling reason. For the device owner, pulling new updates by itself is just a built in backdoor/RCE exploit, and in practice those backdoors are often used maliciously. I'd much rather my devices have no way to update and boot from ROM.
The fact that we have to go as far back as CD players for a decent example illustrates my point - the "CD player" content distribution model is long dead, effectively nobody sells CD players or devices like CD players, effectively nobody distributes digital content on CDs or things like CDs (like, CD sales are even below vinyl sales) - almost every currently existing product receives content through a channel where updates could trivially be sent as well.
And if we're talking about how new products should be designed, then the "almost" goes away and they 100% wouldn't receive new content through anything like CDs, the issue transforms from an irrelevant niche (like CDs nowadays) to a nonexistent one.
> Another good one is grandfathered symbols with both a text and emoji presentation and forcing the chosen glyph with the variant selector prefix.
I despise that Unicode retroactively applied default emoji presentation to existing symbols, breaking old text. Who the hell though that was a good idea?
Possibly because the major software companies made it work on phones. Thus users saw it working on many apps and complain when your app fails to do this.
Google, Apple, IBM and MS also did a lot of localisation so their code bases deal with encoding.
It is FOSS Unix software that had the ASCII mindset, probably as C and C++ string types are ASCII and many programmers want to treat strings as arrays. The MacOS and Windows APIs do take UTF as their input not char * (agreed earlier versions did not put they have provided the UTF encodings for 25 years at least.
Those are ok, but both of those emoji are represented as a single unicode codepoint. Some bugs (particularly UI bugs) only show up when multiple unicode characters combine to form a single grapheme cluster. I'd recommend something fancier.
I just tried it in gnome-terminal, and while the crying emoji works fine, polar bear or a country flag causes weird issues.
Older versions of mac did enforce NFD for file names, but more recent names don't, at least at the OS level. But many apple programs, such as finder _will_ use NFD. Except that it isn't even Unicode standardized NFD, it is Apple's own modified version of it. And this can cause issues when for example you create a file in finder, then search for it using `find`, and type the name of the file the exact same way, but it can't find the file because find got an NFC form, but the actual file is in NFD.
OTOH, in many applications, you don't really care about the normalization form used. For example, if you are parsing a CSV, you probably don't need to worry about if one of the cells using using a single code point or two code points to represent that accented e.
Thanks, yet another quantum of knowledge that makes one's life irreversibly ever so slightly worse. But not as bad as encryption (and learning all the terrible ways most applications have broken implementations in)
We make some B2B software running on Windows, integrating with customer systems. We get a lot of interesting files.
About a decade ago I wrote some utility code for reading files, where it'll try to detect BOM first, if not scan for invalid UTF-8 sequences. If none are found assume UTF-8 else assume Windows-1252. Worked well for us so far.
Still get the occasional flat file in Windows-1252 with one random field containing UTF-8, so some special handling is needed for those cases. But that's rare.
Fortunately we don't have to worry about normalization for the most part. If we're parsing then any delimiters will be one of the usual suspects and the rest data.
Microsoft Windows is a source of many a headache for me as almost every other client I write code for has to deal with data created by humans using MS Office. Ordinary users could be excused, because they are not devs but even devs don't see a difference between ASCII and UTF-8 and continue to write code today as if it was 1986 and nobody needed to support accented characters.
I got a ticket about some "folders with Chinese characters" showing up on an SMB share at work, my first thought was a Unicode issue and sure enough when you combine two UTF-8/ASCII A-z code points together as one UTF-16 code point, it will usually wind up in the CJK Common Ideographs range of Unicode. Some crappy software had evidently bypassed the appropriate Windows APIs and just directly wrote a C-style ASCII string onto the filesystem without realizing that NTFS is UTF-16.
Do you know of a resource that explains character encoding in greater detail? Just for my own curiosity. I am learning web development and boy, they brow beat UTF-8 upon us which okay, I'll make sure that call is in my meta data, but none bother to explain how or why we got to that point, or why it seems so splintered.
This Joel On Software article [0] is a good starting point. Incredibly it's now over 20 years old so that makes me feel ancient! But still relevant today.
The suggestion that the web should just use utf-8 everywhere is largely true today. But we still have to interact with other software that may not use utf-8 for various legacy reasons - the CSV file example in the original article is a good example. Joel's article also mentions the solution discussed in the original article, i.e. use heuristics to deduce the encoding.
Why would it break? If you just assume that the system codepage is UTF-8, then sure. If you specifically say in your manifest that you want UTF-8, then Windows (10+) will give you UTF-8 regardless of which locale it is:
Some [1] may also consider working for any company/app that needs to display an emoji, to be a waste of at least one life (your life, and all your users' lives).
Non-technical users don't want to do that, and won't understand any of that. That's the unfortunate reality of developing software for people.
If Excel generates CSV files with some Windows-1234 encoding, then my "import data from CSV" function needs to handle that, in one way or another. A significant number of people generate CSV files from Excel and if these people upload these files into my application and I assume an incorrect encoding, they won't care that Microsoft is using obsolete or weird defaults. They will see it as a bug in my program and demand that I fix my software. Even if Excel offers them a choice in encoding, they won't understand any of that and more importantly they don't want to deal with that right now, they just want the thing to work.
> then my "import data from CSV" function needs to handle that, in one way or another.
It doesn't. Well, maybe "another".
Your function or even app doesn't need to handle it. Here's what we did on a bookkeeping app: remove all the heuristics, edge cases, broken-csv-handling and validation from all the CVS ingress points.
Add one service that did one thing: receive a CSV, and normalize it. All ingress points could now assume a clean, valid, UTF8 input.
It removed thousands of LOCs, hundreds of issues from the backlog. It made debugging easy and it greatly helped the clients even.
At some point, we offered their import runs for download, we added the [original name]_clean.csv our normalized versions. Got praise for that. Clients loved that, as they were often well aware of their internal mess of broken CVSs.
The point is to separate the cleaning stage from the import stage. Having a clean utf-8 csv makes debugging import issues so much easier. And there are several well working csv tools such as the Python one, that can not only detect character encodings but also record separators and various quotation idiosyncrasies that you also need to be aware of when dealing with Microsoft Office files. Other people have already thought long and hard about that stuff so you don't have to.
Could this work? Implement handling of ancient excel files in your SaaS product, but charge extra dollar for parsing legacy formats and provide information how to export correct files from excel next time :)
Plus, Excel really likes to use semicolons instead of commas for comma-separated-files. That's another idiosyncrasy that programmers need to take into account.
I'm German and I hate this. The icing on top: Excel separates function arguments with a semicolon in German locale, too. Got me some head scratching. Examples are separated with comma, iirc.
>A significant number of people generate CSV files from Excel and if these people upload these files into my application and I assume an incorrect encoding...
I think most people would rather get in their car, drive to the ocean, board a ship and join the navy, and come back after a few years abroad - than following some instructions on how to use the computer.
Joining the navy and then be obliged to follow instructions on so many things you'd normally never even be ordered to do (because now you're in the Navy, sailor! and there are three ways to do any thing: the right way, the wrong way, and the Navy way, and guess which way we use here in the Navy?), just to not follow some instructions on how to use the computer... is something I can actually imagine some people doing.
Pretty funny, and probably some truth to that being a user sentiment.
OTOH, they have to follow some process to use the software. For just the CSV export, they already have to ensure column orders, values, formats, maybe headers, etc. Selecting UTF-8 from a dropdown seems like the easy part.
They’ll probably switch to the application that does it for them and just works, instead of the one telling them to do something they don’t really understand.
Selecting UTF-8 from a dropdown on the export doesn't seem too onerous an ask. If that's the differentiator between you and your competitors, then you might have bigger problems.
I have sat in on and observed many in-person usability sessions with various applications and websites.
In can tell you right now anyone reading this site is in a completely different universe of tech competency than the general public, and even professionals who aren’t tech-focused.
You would have lost many of them simply with then jumble of letters “UTF-8”.
I hope someday, with improved public education, we can have better users. Unlikely with Gen Z apparently having worse computer skills than the previous generation..
Yeah, I started writing software a "while" ago. I've encountered a handful of users (and charsets) in my time.
With CSV exports, there's already a good bit of training you have to do WRT the column layout, file format, headers (or not), cell value formats, etc. There's far more lift involved in training users there than ensuring they select "UTF-8" when they select CSV.
And, there's really nothing technical about it, as they don't need to understand what "UTF-8" actually means any more than they need to understand what "CSV" stand for. It's a simple ask in a list of asks.
I weigh this against having every developer now believe they have to check/convert every character set, which can be unreliable and produce garbage, BTW. And, speaking of garbage, there are some encodings that you won't be able to convert in any case, so it would be technically impossible to preserve data integrity without pushing some requirement on the source up to the user.
So, it's about tradeoffs. And, again, if asking users to choose UTF-8 is the difference between customers choosing your app or a competitor's, then you probably need to be more worried about that than your charset encoding.
Hell we just accept xlsx and xls natively. It sucks but it solves many issues. Also causes some, but you'll never get it out of users' heads that Excel files are not a data exchange format.
Definitely helps. Should be able to detect the charset from the Excel metadata, at least.
Of course, if that charset uses code points not available in your app's native charset, then you're kind of back to square one (unless your use case tolerates garbled or missing data).
> Isn't the point of the article to describe how an engineer would create such a tool?
Honestly, no, because the tool that it's suggesting how to write isn't one that will even come close to doing a good job.
If you want to write such a tool, the first thing you need to do is to understand what the correct answer is. And to do that, you need to sample your input set to figure out what the correct answer should be for several inputs where it matters. There's unfortunately no easy way to avoid that work; universal charset detection isn't really a thing that works all that well.
>universal charset detection isn't really a thing that works all that well.
This seems like something LLMs would be good at. A mundane use of them, but I bet they'd be really good at determining that the input has the wrong encoding. Then the program would iterate through encodings, from most probable to least, and select the one that the LLM likes the most. Granted, this means your tool will be 1GB or more. But hey, thems the breaks.
Yeah, that could be an interesting use of LLMs. It could at least tell you which languages might be present in the input text.
In the 1980s, we had a version of 7-bit ASCII in Sweden where the three extra Swedish vowels "åäö" were represented by "{}|".
So what might look like regular US 7-bit ASCII should be interpreted as the Swedish version if the text is in Swedish with "{}|" where "åäö" normally goes.
I'm glad that didn't stick around, like ¥/₩ being used for directory separators like \. I can't imagine trying to read source code with those substitutions.
> should be asking the user to select explicit input and output formats
It depends on the requirements.
If you're hired by a company to convert millions of old textfiles, they might want you to do it as well as possible using heuristics without any human input as a starting point.
Only with "pure" CJK text in a flat text file; for most real-world situations you'll have enough ASCII text that UTF-8 will be smaller: HMTL/XML tags, email headers, things like that. I did some tests a few years back, and wasn't really able to come up with a real-world situation where UTF-16 is smaller. I'm sure some situations exist, but by and large, CJK users are better off with UTF-8.
Yep. I'm a heavy user of CJK languages and I don't give a damn about the slightly increased plaintext storage. Give me UTF-8 any day, every day. Legacy two-byte encodings can't represent all of the historical glyphs anyway, so there's no room for nationalist crap here.
Well, it's great that you did some tests a few years back, but I'm not sure how that qualifies you to make such a sweeping generalization about CJK text encoding. It's easy to dismiss UTF-16's benefits when you're only looking at a narrow slice of the real world, ignoring the vast amounts of pure CJK literature, historical archives, and user-generated content out there.
That is not the only way. There are other ways of knowing partial contents of files and changes to files, depending on the situation. If the document is a known form in which one of five boxes is checked by the sender, it's probably not hard to rule out certain selections based on the ciphertext length, if not pin down the contents exactly.
I'm not sure i entirely understand your example (if there are 5 checkboxes and 1 checked, presumably length would be the same regardless which one of those are checked). However to your broader point, i agree there exist scenarios along those lines (e.g. fingerprinting known communication based on length), however most of them apply even better when not using compression.
The checkbox example is completely plausible. There is no guarantee that all checkboxes lead to the same number of bytes changed in the file when checked. What if the format makes a note of the page number wherever a checkbox is checked? 1X could be two bytes and 15X would be three.
And even if the format only stored the checkbox states as a single bit each (unlikely), compression algorithms don't care. They will behave differently on different byte sequences, which can easily lead to a difference in output length.
The attack you're referring to is not specific to compression. It's the same class of attack that can reveal keystrokes over older versions of ssh based on packet size and timing, even on uncompressed connections. Conversely, fixed-bitrate voice streams don't have the same vulnerability as variable-bitrate encodings even though they're still compressed.
The version of your checkbox example which is vulnerable without any formal data compression is when the checkbox is encoded in a field that is only included or changes in length if the value isn't the default, common in uncompressed variable-length encodings like JSON.
I'm sure that the people getting hacked care deeply about whether the attack they suffered was sui generis.
Also, zip/deflate etc was not designed to eliminate side channel leakage. Some compression schemes obviously (with padding) can mitigate leaks, but it has to be done deliberately
Any of it has to be done deliberately. The length of the data reveals something about its contents whether it's compressed or not.
The special concern with compression is when attacker-controlled data is compressed against secret data because then the attacker can measure the length multiple times and deduce the secret based not just on the length but on how the length changes when the secret is constant and the attacker-controlled data varies. This can be mitigated with random padding (makes the attack take many times more iterations because it now requires statistical sampling) or prevented by compressing the sensitive data and attacker-controlled data separately.
It's an extension of the chosen-plaintext attack, and so requires the attacker to be able to send custom text that they know is in the encrypted payload. If the unencrypted payload is "our-secret-data :::: some user specified text", then the attacker can eventually determine the contents of our-secret-data by observing how the size of the encrypted response changes as they change the text when the compression step matches up with a part of the secret data. It can be defeated by adding random-length padding after compression and before the encryption step, though.
Essentially if you zip something, repeated text will be deduplicated.
For example "FooFoo" will be smaller than "FooBar" since there is a repeated pattern in the first one.
The attacker can look at the file size and make guesses about how repetitive the text is if they know what the uncompressed or normal size is.
This gets more powerful if the attacker can insert some of their own plaintext.
For example if the plaintext is "Foo" and the attacker inserts "Fo" (giving "FooFo") the result will be smaller than if they inserted zq where there is no pattern. By making lots of guesses the attacker can figure out the secret part of the text a little bit at a time just by observing the size of the ciphertext after inserting different guesses.
Encrypting without zipping doesn't leak any information about the content. You can't rule out certain byte sequences (other than by total length) just by looking at the ciphertext length.
If "oui" compresses to two bytes and "non" compresses to one byte, and then you go over them with a stream cipher, which is which:
This has nothing to do with compression. If you use "yes" and "no" instead of "oui" and "non" (which just happen to be three characters each) and you compress "yes" to "T" and "no" to "F" then the uncompressed text will be the leaky one.
Yes, and my example was an example meant to prove the opposite idea. The point is that it is irrelevant whether you compress or not. You can leak information either way.
It's in the article if you would bother to read it LOL. "simply measuring the size of packets without decoding them can identify whole words and phrases with a high rate of accuracy . . . [the researchers] can search for chosen phrases within the encrypted data"
Cryptography noob here: I'm confused by "Encrypting without zipping doesn't leak any information about the content." Logically speaking, if we compress first and therefore "the content" will now refer to "the zipped content", doesn't this mean we still can't get any useful information?
Not OP, but 'zipping and encrypting' one thing (a file for example) does not leak information by itself. The problem comes when an adversary is able to see the length of your encrypted data, and then can see how that length changes over time - especially if the attacker can control part of the input fed to the compressor.
So if you compressed the string "Bob likes yams" and I could convince you to append a string to it and compress again, then I could see how much the compressed length changed.
If the string I gave you was something already in your data then the string would compress more than it would if the string I gave you was not already in your data - "Bob likes yams and potatoes" will be larger than "Bob likes yams likes Bob".
If the only thing I can see about your data is the length and how it changes under compression - and I can get you to compress that along with data that I hand to you - then eventually I can learn the secret parts of your data.
Encryption generally leaks the size of the plaintext.
This is true in both the compressed and non-compressed case. However with compression the size of the plaintext depends on the contents, so the leak of the size can matter more than when not using compression.
Even without compression this can matter sometimes. Imagine compressing "yes" vs "no".
> Encryption generally leaks the size of the plaintext.
Ah, I see. Naïvely, this seems like a really bad thing for an encryption algorithm to do—is there no way around it? Like, why is encryption different from hashing in this regard?
There are methods, but they are generally very inefficient bandwidth wise in the general case. The general approach is to add extra text (pad) so that all messages are a fixed size (or e.g. some power of 2). The higher the fixed size is, the less information is leaked and the less efficient it is. E.g. if you pad to 64mb but need to transmit a 1mb message, that is 63mb of extra data to transmit.
Part of the problem (afaik) is we lack good math tools to analyze the trade offs of different padding size vs how much extra privacy they provide. This makes it hard to reason about how much padding is "enough".
Another approach is adding a random amount of padding. This can be defeated if you can force the victim to resend messages (which you then average out the size of).
Hashing is different because you don't have to reconstruct the message from the hash. With encryption the recipient needs to decrypt the message eventually and get the original back. However there is no way to transmit (a maximally compressed) message in less space then it takes up.
There are special cases where this doesn't apply e.g. if you have a fixed transmission schedule where you send a sprcific number of bytes on a specific agreed upon schedule.
Yes, of course it leaks more information than encryption without compression, because that’s just encryption which doesn’t leak anything.
In an enormous number of real world cases adversaries can end up including attacker-controller input alongside secret data. In that case you can guess at secret data and if you guess correctly, you get smaller compressed output. But even without that, imagine the worst case: a 1TB file that compresses to a handful of bytes. Pretty clearly the overwhelming majority of the text is just duplicate bytes. That’s information which is leaked.
> We don't go "oh that input that's supposed to be json? It looks like a malformed csv file, let's silently have a go at fixing that up for you". Or at least we shouldn't, some software probably does.
What ever happened to the Robustness Principle[1]? I think the entire comment section of this article has forgotten it. IMO the best software accepts many formats and "deals with it," or at least attempts to, rather than just exiting with "Hahah, Error 19923 Wrong Input Format. Try again, loser."
We collectively discovered that we were underestimating the long term costs, by a lot, so its lustre has faded. This is in some sense relatively recent, so word is still getting around, as the programming world does not move as quickly as it fancies itself to.
If you'd like to see why, read the HTML 5 parsing portion of the spec. Slowly and carefully. Try to understand what is going on and why. A skim will not reveal the issue. You will come to a much greater understanding of the problem.
Some study of what had happened when we tried to upgrade TCP (not the 4->6 transition, that's its own thing) and why the only two protocols that can practically exist on the Internet anymore are TCP and UDP may also be of interest.
The explosion of the web happened in no small part because of how easy it was to write some HTML and get a basic, working webpage out of it. If you nested some tags the wrong way and the browser just put up an error page, rather than doing a (usually) pretty good job figuring out what you actually meant, people would get frustrated faster and not bother with it at all.
But imagine if our C/C++/Java/Rust/Go/etc. compilers were like "syntax error, but ehhhhh you probably meant to put a closing brace there, so let's just pretend you did". That would be a nightmare of bugs and security issues.
The difficulty in drawing a line in the sand and sticking to the spec, though, is that of user blame. Let's say you implement a spec perfectly -- even if you are the originator of the spec -- and then someone comes along and builds something of their own that writes out files that don't conform to the spec. Your software throws up an error and says "invalid file", but the other piece of software can read it back in just fine. Users don't know or care about specifications; they just know that your software "doesn't work" for the files they have, and the other software does. If you try to tell them that the file is bad, and the other software has a bug, they really won't care.
But imagine if our C/C++/Java/Rust/Go/etc. compilers were like "syntax error, but ehhhhh you probably meant to put a closing brace there, so let's just pretend you did". That would be a nightmare of bugs and security issues.
A possible solution to this is for large organizations to be intransigent about standards compliance. If your personal mail server rejects mail that isn't well-formed, you're just being a masochist because nobody is going to change for you. If Google does it, everybody else is going to fix their stuff because otherwise they can't send to gmail.
It proves that engineering is hard and that everything has costs and benefits, and you can't make them go away by just ignoring them. It turns out that "being robust" had a lot of costs we didn't see.
It also shows the harmfulness of binary black and white thinking in engineering. There are choices other than "just let everyone do whatever and hope all the different things picking up the pieces do it in more or less the same way" and "rigidly specify a spec and blow up the universe at the slightest deviation". Both of those easy-to-specify choices have excessive costs. There is no escape from hard design tasks. XHTML may always have been doomed to fail, but that is not to say that HTML had to be allowed to be as loosey-goosey as it is, either.
Had a gradient of failure been introduced rather than a rigid rock wall, things very likely wouldn't have gotten as badly out of hand as they did. If, for instance, a screwed up table was specified to deliberately render in a very aesthetically unappealing manner, but not crash the entire page the way XHTML did, people would have not come to depend so much on HTML being sloppy. The resulting broken page would still be somewhat usable, but there would have been motivation to fix it, rather than the world we actually live in where it all just seemed to work.
Can you give a hint as to what the issue is that one should find reading a portion of the HTML 5 spec? Or is it genuinely unexplainable without experiencing something first-hand?
>>> the best software accepts many formats and "deals with it,"
>> We collectively discovered that we were underestimating the long term costs, by a lot, so its lustre has faded.
>> If you'd like to see why, read the HTML 5 parsing portion of the spec.
> Can you give a hint as to what the issue is that one should find reading a portion of the HTML 5 spec
I think the point was that the HTML 5 spec tries to parse all kinds of weird input instead of drawing a line in the sand and forcing the input to follow a simple format?
It isn't bad. In fact it's quite good. But it is very much a case of closing the barn door after the animals got out. You see in the standard that the effort was put to corral them back in, and I'm very glad they did, but it certainly was not free.
Being lenient is all well and good when the consequences are mild. When the consequences of misinterpreting or interpreting differently to a second implementation becomes costly, such as a security exploit, then the Robustness Principle becomes less obviously a win.
It's important to understand that every implementation will try to fix-up formatting problems in their own way unique to their particular implementation. From that you get various desync or reinterpretation attacks (eg. HTTP request smuggling).
I've tried telling users "sorry, your file isn't to spec", and they say "but it works with <competitor>", and that ideology flies right out the window, along with their money.
Exactly. "Accepting and trying" is how a lot of popular software won their market. Look at HN's favorite media player, VLC. In the past, media player software were horrible, refusing to play all but the most tightly constrained set of allowed containers/codecs. I remember spending the early 2000s trying to get Windows Media Player to play MPEGs by downloading codec packs and trying to cast secret spells into the Windows Registry. Yuck! Then along comes VLC which accepts whatever you throw at it, and that software is basically everywhere now. You can throw line noise at VLC and it will try its best to play something!
The trick is to enforce conformance right from the start in the first implementation of a format. Shipping a product that doesn't interop with the existing tools is a non-starter so the devs will have to fix their shit first.
As you say, unfortunately the genie cannot be put back in the bottle for formats that already have defective implementations in the wild.
The problem is that unless you restrict everyone else you'll get your own product not to interact with *new" tools too.
E.g. you produce valid .wat files, but my software which also outputs those has some bits screwed up.
My program can read both .wat but yours can't, but I have 5% market share.
Your users complain they sometimes receive files your software can't read while the competitor can. Do you tell them "well that file is invalid, tell whoever sent it to you to change the software they use"?
The genie can't stay in the bottle unless you have some sort of certification authority and even that may not be enough (see USB)
But the more every actor aims to follow the spec rather than reproduce everyone else's bugs, the less quickly the ecosystem devolves into a horrid tangled mess that's nigh impossible for a new entrant.
"So you're saying that I can have an advantage over everyone else if I implement all the spec features, plus one extra feature that adds convenience to our users?
And you're saying that by doing this, not only do I gain an advantage over the existing competition, but I also make it more difficult for more competitors to appear?
The customer does not know. They just want it to work. They may be using something that someone else gave them. The original source system of the file may not be changeable. But most importantly, their boss just wants it to work. or else.
Yes, and that's why Postel's law is more of an empirical observation (a law of nature, if you will) on which software survives and which doesn't. You may dislike it but that won't make it go away.
I see it as the opposite, actually. They will pay you for a robust product: one that works for them. They have no care for the technical minutiae of your implementation, because they're just trying to do actually interesting things, with the help of your product. This is the customer perspective.
That is fine in contexts where a wrong guess does no harm.
But that is not always the case, and e.g. silently "fixing" text encoding issues can often corrupt the data if you get it wrong.
By all means offer options of you want, but if you do flag very clearly to the user that they're taking a risk of corrupting the data unless any errors are very apparent and trivial to undo.
This basically loses data integrity if it's wrong though.
You might want to do that with human input if it's helpful to the user - ie user enters a phone number and you strip dashes etc. But if it's machine to machine, it should just follow the spec.
The article addresses this, that current thinking in many places is that the robustness principle / Postel's Law maybe wasn't the best idea.
If you reject malformed input, then the person who created it has to go back and fix it and try again. If you interpret malformed input the best you can (and get it right), then everyone else implementing the same thing in the future now also has to implement your heuristics and workarounds. The malformed input effectively becomes a part of the spec.
This is why HTML & CSS are the garbage dump they are today, and why different browsers still don't always display pages exactly alike. The reason HTML5 exists is because people finally just gave up and decided to standardize all the broken behavior that was floating around in the wild. Pre-HTML5, the web was an outright dumpster fire of browser compatibility issues (as opposed to the mere garbage dump we have today).
Anyway, it's not really important to try to convince you that Postel's Law is bad; what's important is that you know that many people are starting to think it's bad, and there's no longer any strong consensus that it was ever a good thing.
(A text file containing only the ASCII bytes "Bush hid the facts", when opened in Windows Notepad, displays a sequence of CJK characters instead of the expected English sentence.)
I've lived through dealing with non-UTF8 encoding issues and it was a truly gigantic pain in the ass. I'm much more on the side now of people who only want to deal with UTF8 and fully support software that tells any other encoding to go pound sand. The harder life gets for people who use other encodings (yes, particularly Microsoft) the more incentive they have to eventually get on board and stop costing everyone time and effort managing all this nonsense.
> they turn it into utf-8 using a standalone program first
I took the article to be for people who would be writing that "standalone program"?
I have certainly been in a position where I was the person who had to deal with input text files with unknown encodings. There was no-one else to hand off the problem to.
Then you don't have to worry about it since you won't get the work in the future? Someone else, with this presumably correct software, will always be able to do it for less, faster, and at a higher quality.
That's how business works...
If such a business competitor doesn't exist, then yes charge extra, and actually do the work correctly.
Am I missing something here? The work is ingesting documents from uncontrolled sources that might not all be UTF-8 and handling them correctly. Using an encoding guessing tool is the means to do that. In practice since there are only a few widely-used encodings and they're not terribly ambiguous it means that everything just works and users happily use the software.
This isn't some theoretical thing, we do this at $dayjob right now not only guessing the encoding but the file-type as well so that we can make sense out of whatever garbage our users upload. Everything from malformed CSV exports form Excel to PDFs that are just JPEGs of scanned documents. It works, and it works well.
And of course it does, the files our users are handing to us work on their machines. They can open them up and look at them in whatever local software they produced them with, there's no excuse for us to be unable to do the same.
The FCC ULS database records are stored in a combination of no fewer than three different encodings(1252, UTF8, and something else for a handful of German names) that vary per record.
When I brought this up they said something to the effect of: it's already unicodes it has tilde letters!
I did once have a file that had UTF8, Windows-1252, MARC8, and VT100 (really) all mixed up in it. I think the data had gone through multiple migrations between software in its past.
I had write to my own "clean this as well as possible" thing, and it did a good enough job.
Not every encoding can make a round trip through Unicode without you writing ad hoc handling code for every single one. There's a number of reasons some of these are still in use and Unicode destroying information is one of them.
> We don't go "oh that input that's supposed to be json? It looks like a malformed csv file, let's silently have a go at fixing that up for you". Or at least we shouldn't, some software probably does.
Browsers used to have a menu option to choose the encoding you wanted to use to decode the page.
In Firefox, that's been replaced by the magic option "Repair Text Encoding". There is no justification for this.
They seem to be in the process of disabling that option too:
> Note: On most modern pages, the Repair Text Encoding menu item will be greyed-out because character encoding changes are not supported.
> Supporting the specific manually-selectable encodings caused significant complexity in the HTML parser when trying to support the feature securely (i.e. not allowing certain encodings to be overridden). With the current approach, the parser needs to know of one flag to force chardetng, which the parser has to be able to run in other situations anyway, to run.
> Elaborate UI surface for a niche feature risks the whole feature getting removed
> Telemetry [...] suggested that users aren’t that good at choosing correctly manually.
In other words, it's trying to protect users from themselves by dumbing down the browser. (Never mind that people who know what they are doing have probably also turned off telemetry...)
>> Supporting the specific manually-selectable encodings caused significant complexity in the HTML parser when trying to support the feature securely (i.e. not allowing certain encodings to be overridden).
There's no explanation of why you'd want this, or why it's security-relevant.
(Farther down, there's a mention of self-XSS, which definitely isn't relevant.)
>> Elaborate UI surface for a niche feature risks the whole feature getting removed
They've already removed the whole feature. That was easier to do after they mostly disabled it, not harder.
>> Telemetry showed users making a selection from the menu when the encoding of the page being overridden had come from a previous selection from the menu.
That would be an example of "working as expected". The removal of the ability to do this is the problem that disabling the encoding menu causes! Under the old, correct approach, you'd guess what the encoding was until you got it right. Under the new approach, the browser guesses for you, and if the first guess is wrong, screw you.
Probably because most websites now send a correct encoding header or meta Tag, so the user changing can only make it wrong. (Assuming no encoding header is wrong, which happens in reality)
It does happen a lot that old text content in non-UTF-8 encoding is mistakenly served explicitly marked as UTF-8. It is precisely in such circumstances that the Repair Text option is useful.
If you give me a computer timestamp without a timezone, I can and will assume it's in UTC. It might not be, but if it's not and I process it as though it is, and the sender doesn't like the results, that's on them. I'm willing to spend approximately zero effort trying to guess what nonstandard thing they're trying to send me unless they're paying me or my company a whole lot of money, in which case I'll convert it to UTC upon import and continue on from there.
Same with UTF-8. Life's too short for bothering with anything else today. I'll deal with some weird janky encoding for the right price, but the first thing I'd do is convert it to UTF-8. Damned if I'm going to complicate the innards of my code with special case code paths for non-UTF-8.
If there were some inherent issue with UTF-8 that made it significantly worse than some other encoding for a given task, I'd be sympathetic to that explanation and wouldn't be such a pain in the neck about this. For instance, if it were the case that it did a bad job of encoding Mandarin or Urdu or Xhosa or Persian, and the people who use those languages strongly preferred to use something else, I'd understand. However, I've never heard a viable explanation for not using UTF-8 other than legacy software support, and if you want to continue to use something ancient and weird, it's on you to adapt it to the rest of the world because they're definitely not going to adapt the world to you.
It depends on the domain. If you are writing calendar software, it is legitimate to have "floating time" i.e. your medication reminder is at 7pm every day, regardless of time zone, travel, or anything else.
Unfortunately Google and many other companies have decided UTC is the only way, so this causes issues with ICS files that use that format sometimes when they are generating their helpful popups in the GMail inbox.
> If you are writing calendar software, it is legitimate to have "floating time" i.e. your medication reminder is at 7pm every day, regardless of time zone, travel, or anything else.
If you have to take medication (for instance, an antibiotic) every 24 hours, it must be taken at the same UTC hour, even if you took a train to a town in another timezone. Keeping the same local time even when the timezone changes would be wrong for that use case.
If you're there for a while, you'll need to adapt anyway since your biorhythms will too. But there are plenty of other cases like a reminder to check something after dinner, or my standard wake-up alarm in the morning. Or if I plan to travel, book lunch at a nice place for 1pm, and put it in my calendar I just want it to be 1pm wherever I go, without caring about TZ changes.
Calendars, alarms, and reminders have some overlap here and floating time can be good for some cases.
There are very few drugs where that's a requirement. Your kidneys and liver aren't smart enough to metabolize anything at precisely the same rate every day anyway.
> For instance, if it were the case that it did a bad job of encoding Mandarin
I don't know if you picked this example on purpose, but using UTF-8 to encode Chinese is 50% larger than the old encoding (GB2312). I remember people cared about this like twenty years ago. I don't know of anyone that still cares about this encoding inefficiency. Any compression algorithm is able to remove such encoding inefficiency while using negligible CPU to decompress.
That doesn't seem like the worst issue imaginable. I doubt there are too many cases where every byte counts, text uses a significant portion of the available space, and compression is unavailable or inefficient. If we were still cramming floppies full of text files destined for very slow computers, that'd be one thing. Web pages full of uncompressed text are still either so small that it's a moot point or so huge with JS, images, and fonts that the relative text size isn't that significant.
Which is all to say that you're right, but I can't imagine that it's more than a theoretical nuisance outside some extremely niche cases.
They shouldn't be non-existent. Zip-then-encrypt is not secure due to information leakage.
EDIT: also, it's not safe—message length is dependent on the values of the plaintext bytes, period. i'm not saying don't live dangerously, i'm just saying live dangerously knowing
The information leakage problem occurs when compression is done in the TLS layer, because then the compression context includes both headers (with cookies) and bodies (containing potentially attacker-controlled data). But if you do compression at the HTTP layer using its Transfer-Encoding then the compression context only covers the body, which is safe.
It can still leak data if attackers can get their input reflected. I.e. I send you a word, and then I get to observe a compressed and encrypted message including my word and sensitive data. If my word matches the sensitive data, the cyphertext will be smaller. Hence I can learn things about the cipgertext. That is no longer good encryption.
What you are talking about is generally referred to as the "BREACH" attack. While there may theoretically be scenarios where it is relavent, in practise it almost never is so the industry has largely decided to ignore it (its important to distinguish this from the CRIME attack which is about http headers instead of the response body which has a much higher liklihood of being exploitable while still being hard).
The reason its usually safe is that to exploit you need:
- a secret inside the html file
- the secret has to stay constant and cannot change (since it is adaptive attack. CSRF tokens and similar things usually change on every request so cannot be attacked)
- the attacker has to have a method to inject something into the html file and repeat it for different payloads
- the attacker has to be able to see how many bytes the response is (or some other side channel)
- the attacker is not one of the ends of the communication (no point to attack yourself)
Having all these requirements met is very unlikely.
For Asian languages, UTF-8 is basically the same size as any other encoding when compressed[0] (and you should be using compression if you care about space) so in practice there is no data size advantage to using non-standard encodings.
In addition, Chinese characters encode more information than English letters, so a text written in Chinese will generally consume fewer bytes than the same text in English even when using UTF-8.
(Consider: Horse is five letters, but 馬 is one character. Even at three bytes per character, Chinese wins.)
Presumably that derives from the overhead of encoding an english character as a full byte? Given there's only 26 characters normally, you could fit that into 5 bits instead, which funnily enough does actually line up with the chinese character encoding (5x5 vs 1x24).
A key aspect is that nowadays we rarely encode pure text - while other encodings are more efficient for encoding pure Mandarin, nowadays a "Mandarin document" may be an HTML or JSON or XML file where less than half of the characters are from CJK codespace, and the rest come from all the formatting overhead which is in the 7-bit ASCII range, and UTF-8 works great for such combined content.
> For instance, if it were the case that it did a bad job of encoding Mandarin
Please look up the issues caused by Han unification in Unicode. It’s an important reason why the Chinese and Japanese encodings are still used in their respective territories.
I can't help myself. The grandest of nitpicks is coming your way. I'm sorry.
> If you give me a computer timestamp without a timezone, I can and will assume it's in UTC.
Do you mean, give you an _offset_? `2024-04-29T14:03:06.0000-8:00` the `-8:00` is an offset. It only tells you what time this stamp occurred relative to standard time. It does not tell you anything about the region or zone itself. While I have consumed APIs that give me the timezone context as part of the response, none of them are part of the timestamp itself.
The only time you should assume a timestamp is UTC is if it has the `z` at the end (assuming 8601) or is otherwise marked as UTC. Without that, you have absolutely no information about where or when the time has occurred -- it is local time. And if your software assumes a local timestamp is UTC, then I argue it is not the sender of that timestamp's problem that your software is broken.
My desire to meet you at 4pm has no bearing on if the DST switchover has happened, or my government decides to change the timezone rules, or if {any other way the offset for a zone can change for future or past times}. My reminder to take my medicine at 7pm is not centered on UTC or my physical location on the planet. Its just at 7pm. Every day. If I go from New York to Paris then no, I do not want your software to tell me my medicine is actually supposed to be at Midnight. Its 7pm.
But, assuming you aren't doing any future scheduling, calendar appointments, bookings, ticket sales, transportation departure, human-centric logs, or any of the other ways Local Time is incredibly useful -- ignore away.
As I mentioned in another reply, "remind me every day at 7PM" isn't a timestamp. It's a formula for how to determine when the next timestamp is going to occur. Even those examples are too narrow, because it's really closer to "remind me the next time you notice that it's after 7PM wherever I happen to be, including if that's when I cross a time zone and jump instantly from 6:30PM to 7:30PM".
Consider my statement more in the context of logs of past events. The only time you can reasonably assume a given file is in a particular non-UTC TZ is when it came from a person sitting in your same city, from data they collected manually, and you're confident that person isn't a time geek who uses UTC for everything. Otherwise there's no other sane default when lacking TZ/offset data. (I know they're not the same, but they're similar in the sense that they can let you convert timestamps from one TZ to another).
> As I mentioned in another reply, "remind me every day at 7PM" isn't a timestamp. It's a formula for how to determine when the next timestamp is going to occur. Even those examples are too narrow, because it's really closer to "remind me the next time you notice that it's after 7PM wherever I happen to be, including if that's when I cross a time zone and jump instantly from 6:30PM to 7:30PM".
That's certainly fair in the context of a recurring event with some formula. I caution that a lot of people will still immediately reach for timestamps to calculate that formula, particularly for a next occurrence, and in the context if this conversation they would be given as an ISO8601 datetime based on Local Time. I would also caution that calendar events that have a distinct moment in time that they start are also prime for Local Time where a UTC-default mentality will cause errors.
> Consider my statement more in the context of logs of past events
From the stance of computer generated historical log data, I definitely agree that UTC everywhere is a sane default and safe to assume :)
(And, in your defense, I would definitely argue UTC-everywhere gets you 95% of the way there for 5% of the effort... I get why people make the tradeoff)
> (I know they're not the same, but they're similar in the sense that they can let you convert timestamps from one TZ to another).
More nitpicking on my part, again, I'm sorry, it lets you convert from one _offset_ to another or from an offset to UTC. Think Arizona being a special snowflake who (mostly!) doesn't observe DST. You can't assume all UTC-7 offsets are all Mountain Time.
It's always nice to see someone who actually understands time.
"Convert to UTC and then throw away the time zone" only works when you need to record a specific moment in time so it's crazy how often it's recommended as the universal solution. It really isn't that hard to store (datetime, zone) and now you're not throwing away information if you ever need to do date math.
Yeah, I've been trying to convince people forever to store time zones with timestamps when appropriate. If you record events from around the world and don't record what time zone they happened in you can't even answer basic questions like "what proportion happened before lunch time?"
People love simple rules and they will absolutely take things too far. Most developers learn "just use UTC!" and think that's the last thing they ever need to learn about time.
Developers should assume UTF-8 for text files going forward.
UTF-8 should have no BOM. It is the default. And there are no undefined Byte sequences that need an Order. Requiring a UTF-8 BOM just destroys the happy planned property that ASCII-is-UTF8. Why spoil that good work?
Others variants of Unicode have BOMs, e.g. UTF-16BE.
We know CJK languages need UTF-16 for compression.
The BOM is only a couple more bytes.
No problem, so far so good.
But there are old files, that are in 'platform encoding'.
Fine, let there be an OS 'locale', that has a default encoding.
That default can be overridden with another OS 'encoding' variable. And that can be overridden by an application arg. And there may be a final default for a specific application, that is only ever used with one language encoding. Then individual files can override all of the above ...
Text file (serialization) formats should define handling of optional BOM followed by an ASCII header that defines the encoding of the body that follows. One can also imagine .3 filetypes that have a unique or default encoding, with values maintained by, say, IANA (like MIME types). XML got this right. JSON and CSV are woefully neglectful, almost to the point of criminal liability.
But in the absence of all of the above, the default-default-default-default-default is UTF-8.
We are talking about the future, not the past. Design for UTF-8 default, and BOMs for other Unicodes. Microsoft should have defined BOMs for Windows-Blah encodings, not for UTF-8!
When the whole world looks to the future, Microsoft will follow. Eventually. Reluctantly.
The specific use case the OP author was focusing on was CSV. (A format which has no place to signal the encoding inline). They noted that, to this day, Windows Excel will output CSV in Win-1252. (And the user doing the CSV export has probably never even heard of encodings).
If you assume UTF-8, you will have corrupted text.
I agree that I'm mad about Excel outputting Win-1252 CSV by default.
You are suggesting that if software developers willfully refuse to implement measures to detect CP-1252 in CSVs, instead insisting on assumign UTF-8 even though they know that will result in lots of corruption with data from the most popular producer of CSVs -- you are suggesting that will be pressure on MS to make it output in UTF-8 by default instead?
If the world worked that way, it would be a very different world than the one we've got.
If you think you can detect only cp-1252 i have news for you. This is 1-byte encoding, it can't fit much. So, be prepared for whole zoo of other 1-byte encodings from that era (e.g. cyrillic letters - welcome to cp-1251, where everything above first 128 chars have totally different meaning). Writing encoding detector is not easy at all. Chance of guessing wrong is high. I'm glad most of the world (but not excel) moved on away from that can of worms
> You are suggesting that if software developers willfully refuse to implement
it will happen anyway, sooner or later
> will result in lots of corruption
there are lots of corruption anyway, e.g euro sign, some european letters, whole other alphabets like cyrillic one). The most dangerous are subtle ones like german ẞ)
P.S. I don't have office installed to check, but online version exports non cp-1252 chars as "?" . So nice
P.P.S. Apple Numbers offers choice of encoding on export, with UTF-8 default. Google Sheets exports as UTF-8 without choice
OP is in fact about doing your best to detect encoding without proper tagging. It is not perfect and you won't have 100% accuracy. But it can get pretty decent (OP is literally about the techniques used to do so, statistically etc), and is necessary because of the actual world we live in. If just refusing to do it would get MS to export in UTF-8 by default that would of course have happened a long time ago!
Programming languages have lumbered slowly towards UTF-8 by default but from time to time you find an environment with a corrupted character encoding.
I worked at an AI company that was ahead of its time (actually I worked at a few of those) where the data scientists had a special talent for finding Docker images with strange configurations so all the time I'd find out one particular container was running a Hungarian or other wrong charset.
(And that's the problem with Docker... People give up on being in control of their environment and just throw in five different kitchen sinks and it works... Sometimes)
If csv files bring criminal liability then I am guilty.
Sidenote: this particular criminal conspiracy is open to potential new members. Please join the Committee To Keep Csv Evil: https://discord.gg/uqu4BkNP5G
Jokes aside, talking about the future is grand but the problem is that data was written in the past and we need to read it in the present. That means that you do have to detect encoding and you can't just decide that the world runs on UTF-8. Even today, mainland China does not use UTF-8 and is not, as far as I know, in the process of switching.
I understand UTF-8 is mostly fine even for east asian languages though - and bytes are cheap
There is no CSV specification. That does bring opprobrium. RFC4180 is from 2005, long after Unicode and XML, so people should have known better. The absence of a standard points to disagreement, perhaps commercial disagreement, but IETF should be independent, is it not?
That failure to standardize encoding, and other issues (headers, etc.), has wasted an enormous amount of time for many creative hackers, who could have been producing value, rather than banging their head against a vague assembly of pseudo-spec-illusions. Me included, sadly.
You're blaming the lack of a spec for all the wasted time but that's not the cause.
The cause is that CSV is popular and is popular because it is incompletely defined. (See also: HTML, RSS.)
Making a CSV spec post-hoc solves nothing, as others here have pointed out, because there is already an "installed base" so to speak. If anything it's worse because it might mislead some people into thinking they can write to the spec and handle any CSV file.
The right move, if you want a nicely precise and strict spec, is to admit it's a new thing and give it a new name, maybe CSVS or something like that.
But good luck making it popular - there are plenty of CSV libraries out there that handle most CSV files well enough, just as there is tons of software that handles HTML and RSS well enough (which is why I am a fan of both those formats).
The only argument I’ll present for giving over CSV to those who want to flail at the idea of standardizing it: the name implies too much standardization already. Why “C”SV? Most CSV processing tools accept a delineator, right? They are just whatever separated values, use semicolons or tabs if you want.
I'm sorry, quite right - you did. Using the computer locale.
The trouble is that people transmit data from one computer to another and so from one locale to another. And sadly, they do not always set the character encoding header correctly, if they even know what that is.
I mean take csvbase's case. It has to accept csv files from anyone. And christ preserve me, they aren't going to label them as "Win-1252" or "UTF-16" or whatever.
There is no alternative but statistical detection. And there is good evidence that this solution is fairly satisfactory, because millions of computers are using it right now! csvbase uploads run into more problems with trailing commas than with character detection getting it wrong at this point, that is your "Schelling Point" I'm afraid!
HTTP actually does quite a good job of providing headers containing MIME type and encoding. There is a little work to get the default (e.g. HTML and XML are different), and decide on the case where the XML payload encoding is different to the HTTP transport encoding (e.g. perhaps XML parsers need a way to override the embedded header).
So we end up at another plausible future-directed design decision: computer-computer communication should use HTTP. I think many systems have ended up there already, perhaps prompted by the issues we have discussed.
Moral: good specs attract usage; bad, incomprehensible or inconsistently implemented specs fade away.
> China does not use UTF-8 and is not, as far as I know, in the process of switching.
That’s … not true? Most Chinese software and websites are utf8 by default, and it’s been that way for a while. GBK and her sisters might linger around in legacy formats, but UTF-8 has certainly reached the point of mass adoption from what I can tell.
> We know CJK languages need UTF-16 for compression.
My understanding is that it is for the opposite of compression: it saves memory when uncompressed versus the UTF-8 surrogates needing more bytes. My understanding is that UTF-8 surrogates compress pretty well as they have common patterns that form dictionary "words" just as easily anything else. UTF-8 seems to be winning in the long run for even CJK and astral plane languages on disk and the operating systems and applications that were preferring UTF-16 in memory are mostly only doing so out of backwards compatibility and are themselves often using more UTF-8 buffers internally as those reflect the files at rest.
(.NET has a backwards compatibility based on using UTF-16 codepoint strings by default but has more and more UTF-8 only pathways and has some interesting compile time options now to use UTF-8 only today. Python 3 made the choice that UTF-8 was the only string format to support, even with input from CJK communities. UTF-8 really does seem to be slowly winning everything.)
> JSON and CSV are woefully neglectful,
As the article also points out, JSON probably got it right: UTF-8 only and BOM is an error (because UTF-8) (but parsers are allowed to gently ignore that error if they wish). https://www.rfc-editor.org/rfc/rfc8259#section-8.1
That seems to be the way forward for new text-based formats that only care about backward compatibility with low byte ASCII: UTF-8 only, no BOM. UTF-8 (unlike UTF-16 and missing reservations for some of its needed surrogates) is infinitely expandable if we ever do find a reason to extend past the "astral plane".
(Anyone still working in CSV by choice is maybe guilty of criminal liability though. I still think the best thing Excel could do to help murder CSV is give us a file extension to force Excel to open a JSON file, like .XLJSON. Every time I've resorted to CSV has been because "the user needs to double click the file and open in Excel". Excel has great JSON support, it just won't let you double click a file for it, which is the only problem, because no business executive wants the training on "Data > From JSON" no matter how prominent in the ribbon that tool is.)
> When the whole world looks to the future, Microsoft will follow.
That ship is turning slowly. Windows backward compatibility guarantees likely mean that Windows will always have some UTF-16, but the terminals in Windows now correctly default to UTF-8 (since Windows 10) and even .NET with its compatibility decrees is more "UTF-8 native" than ever (especially when compiling for running on Linux, which is several layers of surprise for anyone that was around in the era where Microsoft picked UCS-2 as its one true format in the first place).
You can fit Japanese comfortably in a 16 bit charset but Chinese needs more than that.
My take though is that CSV is not a good thing because the format isn't completely standardized, you just can't count that people did the right thing with escaping, whether a particular column is intended to be handled as strings or numeric values, etc.
Where I work we publish data files in various formats, I'm planning on making a Jupyter notebook to show people how to process our data with Pandas -- one critical point is that I'm going to use one of the commercial statistics data formats (like Stata) because I can load the data right the first time and not look back. (e.g. CSV = good because it is "open" is wrong)
If I am exporting files for Excel users I export an Excel file. Good Excel output libraries have been around for at least 20 years and even if you don't have fun with formulas, formatting and all that it is easy to make a file that people will load right the first time and every time.
> Good Excel output libraries have been around for at least 20 years
I wish that were the case more often. Depends on your ecosystem, of course.
For instance, I've yet to find a good XLSX library for JS that works well "idempotently" (in the browser as well as Node/etc). Every one I tried either had native dependencies and couldn't run in-browser or had a cost (time, money, size) I couldn't budget for at the time.
I have found some XLS libraries for JS that were extremely "mid", but outputting XLS is nearly as bad as CSV in 2024. (Including all the huge messy legacy of character set Encoding problems.)
The best and worst thing about CSV is that it seems "low overhead": it seems really cheap to output. ("How hard can it be, just ','.join(records)?" has been the pit so many of us fall into over and over again and sometimes never come out.) In terms of low overhead: in a world where all my APIs are already talking JSON if I can wrap an existing HTTP API with just two extra headers to get "free" Excel files for my users, that could be a beautiful world:
All the pieces are already there. If you could teach every user to use "Data > From JSON" we could maybe have nice things today instead of yet another CSV export dump. We just need someone on the Excel team to greenlight a "double to click to open an .XLJSON file" feature.
First and foremost to avoid accidents: We don't want a misconfigured HTTP website accidentally opening a new Excel window for every fetch/XHR call or to have to fight Excel defaults to get JSON to open up in our IDE or Dev Tools of choice. We don't want random shell scripts accidentally curl-ing things to Excel. Things like that.
Secondly for "ownership" reasons: We don't want to give non-developers the mistaken impression that Excel "owns" JSON and that it is a Microsoft format. I've had people tell me that CSV must be a Microsoft format because you can double click them in Excel and they show an Excel-like icon (in some ways it has been too long since Lotus existed and Excel was in the "we'll take all of our competitors' file associations too" era). On the one hand it might be nice to blame all of CSV's problems on Microsoft and Excel if that were actually the case, but on the other hand it also confuses people as to the real/valid uses of the format. Unfortunately, too, that transitive relationship goes both ways and I've heard second hand that CSV files are among the reasons the Excel team hopes to never add another file type association again because supposedly they get far too many support requests for CSV file problems that maybe shouldn't be their job to deal with.
A separate file association adds some intent of "this file was actually meant to be opened in Excel and hopefully the developer actually tested it some".
What you do, rather, is drop support for non-UTF-8.
Work with tech-stacks whose text handling is based strictly around Unicode and UTF-8, and find enough opportunities that way that you don't have to care about anything else.
Let the customers who cling to data in weird encodings go to someone who makes it a nice to support that.
Joel spolsky spoke against this exact statistics-based approach when he wrote about unicode[1]:
> What do web browsers do if they don’t find any Content-Type, either in the http headers or the meta tag? Internet Explorer actually does something quite interesting: it tries to guess, based on the frequency in which various bytes appear in typical text in typical encodings of various languages, what language and encoding was used. Because the various old 8 bit code pages tended to put their national letters in different ranges between 128 and 255, and because every human language has a different characteristic histogram of letter usage, this actually has a chance of working. It’s truly weird, but it does seem to work often enough that naïve web-page writers who never knew they needed a Content-Type header look at their page in a web browser and it looks ok, until one day, they write something that doesn’t exactly conform to the letter-frequency-distribution of their native language, and Internet Explorer decides it’s Korean and displays it thusly, proving, I think, the point that Postel’s Law about being “conservative in what you emit and liberal in what you accept” is quite frankly not a good engineering principle.
I don't think he was speaking against the statistics-based approach itself, just against Postel's Law in general.
Ideally people would see gibberish (or an error message) immediately if they don't provide an encoding; then they'll know something is wrong, figure it out, fix it, and never have the issue again.
But if we're in a situation where we already have lots and lots of text documents that don't have an encoding specified, and we believe it's not feasible to require everyone to fix that, then it's actually pretty amazing that we can often correctly guess the encoding.
There's enca library (and cli tool) which does that. I used it often before UTF-8 became overwhelming. The situation was especially dire with Russian encodings. There were three 1-byte encodings which were quite wide-spread: KOI8-R mostly found in unixes, CP866 used in DOS and CP1251 used in Windows. What's worse, with Windows you sometimes had to deal with both CP866 and CP1251 because it includes DOS subsystem with separate codepage.
Exactly. I used this technique at Mozilla in 2010 when processing Firefox add-ons, and it misidentified scripts as having the wrong encoding pretty frequently. There's far less weird encoding out there than there are false positives from statistics-based approaches.
If UTF-8 decoding fails, then it's time to pull out the fancy statistical tools to (unreliably) guess an encoding. But that should be a fallback, not the first thing you try.
> If UTF-8 decoding fails, then it's time to pull out the fancy statistical tools to (unreliably) guess an encoding.
Don't really even need to do that. There's only a handful other encodings still in common use, just try each of them as fallbacks and see which one works without errors, and you'll manage the vast majority of what's not UTF-8.
(We recently did just that for a system that handles unreliable input, I think I remember our fallback only has 3 additional encodings before it gives up and it's been working fine)
The person you're replying to sort of addresses this, though not completely.
Since UTF-8 is a variable-length encoding, it somewhat naturally has some error detection built in. Fixed-length encodings don't really have that, and for some of them, any byte value, 0 to 255, in any position, is valid. (Some have a few byte values that are invalid or reserved, but the point still stands.)
So you could very easily pick a "next most common encoding" after UTF-8 fails, try it, find that it works (that is, no bytes are invalid in that encoding), but it turns out that's still not actually the correct encoding. The statistics-based approach will nearly always yield better results. Even a statistics-based approach that restricts you to a few possible encodings that you know are most likely will do better.
These are actually interpreted as the corresponding C1 control codes by Windows, so arguably not invalid in practice, just formally reserved to be reassigned to other characters in the future.
> Many charsets decode fine as UTF-8 as long as the message happens to fit in ASCII.
At which point the message is effectively ASCII. UTF-8 is a superset of ASCII, so "decoding" ASCII as UTF-8 is fine.
(Yes, I know there are some Japanese text encodings where 0x5c is decoded as "¥" instead of "\". But they're sometimes treated as backslashes even though they look like ¥ symbols so handling them "correctly" is complicated.)
"Fun" fact: some video subtitle formats (ASS specifically) use "text{\b1}bold" to format things — but since they were primarily used to subtitle Japanese anime, this frequently became "text{¥b1}bold". Which is all good and well, except when those subtitles moved to UTF-8 they kept the ¥. So now you have to support ¥ (0xC2 0xA5) as a markup/control character in those subtitles.
I'm guessing they're thinking of Extended ASCII (the 8-bit one that's actually multiple different encodings, but the lower half is shared with 7-bit ASCII and so that part does fit in UTF-8 while the upper half likely won't if the message actually uses it).
The definition GP is using most likely refers to non-ASCII sequences that validly decode as UTF-8, because virtually every major charset in practical use has ASCII as a subset.
ASCII (or ISO646) has local variants [0] that replace certain characters, like $ with £, or \ with ¥. The latter is still in use in Japan. That’s why "ASCII" is sometimes clarified as "US-ASCII".
I'm skeptical. Any charset that uses bytes 128-255 as characters is unlikely to successfully decode to UTF-8. Are there really many others that only use 0-127, or most text ends up only using 0-127?
I think there are a bunch of encodings that just repurposed a few ASCII characters as different characters - someone on this page was giving the example of some Swedish encoding where {}| were replaced with three accented Swedish letters. There are probably a bunch of others. In those cases, the text will decode fine as UTF-8, but it will display the wrong thing.
I meant, some messages will decode fine as UTF-8, but in some other messages there may be letters which don't fit in 7 bits. So some simple testing, especially with English words, will show it to work fine. But as soon as a non-7-bit characters creeps, it will stop working fine.
It's garbage anyway, which you can (non-reliably) guess by there being a Korean character in the middle of C/J kanji. (Kanji are not completely gone from Korean, but mostly.)
Based on my past role, you can't even assume UTF-8 when the file says it's UTF-8.
Clients would constantly send CSV or other files with an explicit BOM or other marking indicating UTF-8 but the parser would choke since they just output native Windows-1252 or similar into it. I think some programs just spit it out since it's standard.
Better to assume UTF8 and fail with a clear message/warning. Sure you can offer to guess to help the end user if it fails, but as other people have pointed out, it’s been standard for a long time now. Even python caved and accepted it as the default: https://peps.python.org/pep-0686/
Off-topic, but the bit numbering convention is deliciously confusing.
Little-endian bytes (lowest byte is leftmost) and big-endian bits (bits contributing less numerical value are rightmost) are normal, but the bits are referenced/numbered little-endian (first bit is leftmost even though it contributes the most numerical value). When I first read the numbering convention I thought it was going to be a breath of fresh air of someone using the much more sane, but non-standard, little-endian bits with little-endian bytes, but it was actually another layered twist. Hopefully someday English can write numbers little-endian, which is objectively superior, and do away with this whole mess.
It actually is if we did not need to consider historical baggage.
Especially in programming where we already use in-band encoding like 0x to denote a hex string or 0b to denote a binary string. I like using 1{s}, e.g. 1x to denote a little-endian hex string and 1b to denote a little-endian binary string, to denote little-endian encoding.
But, even ignoring programming, it is still better in normal use. The Arabic language got it right writing little-endian (Arabic numbers are written the same, but Arabic is a right-to-left language so it is actually little-endian), and the European languages just stole it stupid by copying the form instead of the function.
From what I have seen, low numbers in Arabic are spoken/written little-endian (twenty five is five and twenty). Apparently German as well. The internet claims that historically large numbers in Arabic were also written out (as in when using number words rather than numerals) little-endian.
"Although generally found in text written with the Arabic abjad ("alphabet"), numbers written with these numerals also place the most-significant digit to the left, so they read from left to right (though digits are not always said in order from most to least significant[10]). The requisite changes in reading direction are found in text that mixes left-to-right writing systems with right-to-left systems."
Default UTF-8 is better than the linked suggestion of using a heuristic, but failing catastrophic when old data is encountered is unacceptable. There must be a fallback.
(Note that the heuristic for "is this intended to be UTF-8" is pretty reliable, but most other encoding-detection heuristics are very bad quality)
You can't just assume UTF-8, but you can verify that it is almost surely encoded in UTF-8 unlike other legacy encodings. Which makes UTF-8 the first and foremost consideration.
If it's turtles all the way down and at every level you use utf-8, it's hard to see how any input with a different encoding (for the same underlying text) will not be detected before any unintended side effects were invoked.
At this point, I don't see any sufficiently good reason to not use utf-8 exclusively in any new system. Conversions to and from other encodings would only be done at well defined boundaries when I'm calling into dependencies that require non utf-8 input for whatever reason.
Thanks for giving me an example of an architecture where the bits are labelled backwards, I'd never encountered that before. I've always appreciated that the bit number represents 2 to the power of that number.
Basically everyone uses x86 bit numbering. It has the pleasant property that the place value of every bit is always 2^n (or -2^n for a sign bit), and zero-extending a value doesn't change the numbering of its bits.
The more I thought it through, even assuming x86, I guess there’s just no “correct” way to casually reference bit positions when we read them in the opposite order from the machine. Are they being referenced from the perspective of a human consumer of text, or the machine’s perspective as a consumer of bits? If I were writing that content, I’d have a difficult time deciding on which to use. If I were writing for a lay person, referencing left-to-right seems obvious, but in this case where the audience is primarily developers, it becomes much less obvious.
The probability of web content not in UTF-8 is increasingly getting lower and lower.
Last I tracked, as of this month, 0.3% of surveyed web pages used Shift JIS. It has been declining steadily. I really hope people move to UTF-8. While it is important to understand how the code pages and encodings helped, I think it's a good time to actually start moving a lot of applications to use UTF-8. I am perfectly okay if people want to use UTF-16 (the OG Unicode) and it's extensions alternatively, especially for Asian applications.
Yes, historic data preservation requires a different strategy than designing stuff for the future. It is okay to however migrate to these encodings and keep giving old data and software new life.
Just the most recent episode: A statistician is using PHP, on Windows, to analyze text for character frequency. He's rather confused by the UTF-16LE encoding and thinks the character "A" is numbered 4100 because that's what is shown in a hex-editor. I tried explaining about the little-endian part, and mb-string functions in PHP. And that PHP is not a good fit for his projects.
Then I realized that this is hilarious and I won't be able to kick him from his local minimum there. Everything he could learn about encodings would first complicate his work.
The post seems to assume that only UTF-16 has Byte Order Marks, but as pointless as it sounds, UTF-8 has a BOM too (EF BB BF). It seems to be a Windows thing though, haven't seen it in the wild anywhere else (and also rarely on Windows - since text editors typically allow to save UTF-8 files with or without BOM. I guess it depends on the text editor which of those is the default).
That's not really a byte order mark though, it's just the UTF-8 encoding of U+FEFF, which corresponds to the byte order mark in UTF-16. Honestly, emitting that into UTF-8 was probably the result of a bug originally, caused by Windows Unicode APIs being designed for UTF-16.
Yes you're right, UTF-8 technically does as well. I've never seen them in real life either.
UTF-16 BOMs do have a useful function as I recall: they really help Excel detect your character encoding (Excel is awful at detecting character encoding).
Anyone got EBCDIC on their bingo cards? Because if the argument is "legacy encodings are still relevant in 2024" then we also need to bring EBCDIC (and EBCDIK and UTF-EBCDIC for more perverted fun) into the picture. Makes heuristics extra fun.
Or, you know, just say "nah, I can, those ancient stuff don't matter (outside of obligatory exceptions, like software archeology) anymore." If someone wants to feed me a KOI8-R or JIS X 0201 CSV heirloom, they should convert it into something modern first.
> Anyone got EBCDIC on their bingo cards? Because if the argument is "legacy encodings are still relevant in 2024"
I have a hobby interest in IBM mainframes and IBM i, so yes to EBCDIC for me. (I have encountered them professionally too, but only to a very limited extent.) In practice, I find looking for 0x40 (EBCDIC space) a useful heuristic. Even in binary files, since many mainframe data structures are fixed length space padded.
> then we also need to bring EBCDIC (and EBCDIK and UTF-EBCDIC for more perverted fun) into the picture. Makes heuristics extra fun.
Actual use of UTF-EBCDIC, while not nonexistent, has always been extremely rare. A person could spend an entire career dedicated to IBM mainframes and never encounter it
EBCDIK, at first I wondered if that was a joke, now I realise it is a name used for non-IBM Japanese EBCDIC code pages. Again, something one can spend a whole career in mainframes and never encounter – if one never works in Japan, if one works for a vendor whose products aren't sold in Japan, probably even if you work for a vendor whose products are sold in Japan but only to IBM sites (as opposed to Fujitsu/Hitachi sites)
Actually, I can just assume UTF-8, since that's what the world standardized on. Just like I can assume the length of a meter or the weight of a gram. There is no need to have dozens of incompatible systems.
> would you guess the unit instead of specifying what you expect?
It depends on the circumstances. It might be the least bad thing to do. Or not.
But that wasn't my point. I replied to this:
> I can assume the length of a meter or the weight of a gram
Sure, the length of a meter and the "weight" of a gram are both standardized. (To be very picky, "gram" is a mass, not a weight. The actual weight depends on the "g" constant, which on average is 9.81 m/s^2 on earth, but can vary about 0.5%.)
So if you know the input is in meters, you don't need to do any further processing.
But dealing with input text files with an unknown encoding is like dealing with input lengths with an unknown unit.
So while UTF-8 itself might be standardized, it is not the same as all input text files always being in UTF-8.
You can choose to say that all input text files must be in valid UTF-8, or the program refuses to load them. Or you can use silent heuristics. Or something inbetween.
I don't understand the outraged tone. Asking developers to write actually good software shouldn't be viewed as some kind of crazy imposition. We don't have to write everything like it is running on a spacecraft (which I never claimed), but we should try to make it actually good. For example, if there was a web browser with a security compromise and the makers left it unfixed for a long time, there would be consequences. Saying "well, it's just a browser" wouldn't cut it.
More to the point, what situation can you think of where guessing measurement units is a good idea? In a CNC machine? Maps program? Somewhere else? You seem to have omitted the actual counterargument part from your counterargument, while adding a hearty dash of misplaced outrage.
You can start by assuming UTF-8, then move on to other heuristics if UTF-8 decoding fails. UTF-8 is "picky" about the sequence of bytes in multi-byte sequences; it's extraordinarily unlikely that text in any other encoding will satisfy its requirements.
(Other than pure ASCII, of course. But "decoding" ASCII text as UTF-8 is safe anyway, so that hardly matters.)
I don't think that's true. Looking at how it's encoded[0], it seems similar to many other country/language-specific encodings: bytes 0-127 are the control chars and latin alphabet and symbols, and is more-or-less ASCII, then 128-255 represent characters specific to the language at hand.
The only way you'd successfully decode Shift-JIS as UTF-8 is if it essentially is just latin-alphabet text (though the yen symbol would incorrectly display as a '\'). If it includes any non-trival amount of Japanese in it, it'll fail to decode as UTF-8.
As for whether or not you can then (after it fails to decode as UTF-8) use statistical analysis to reliably figure out that it's in fact Shift-JIS, and not something else, I can't speak to that.
Do you have an example in mind? Looking at the Shift-JIS encoding tables, that seems unlikely to happen in a text of any nontrivial length; there's a small number of Shift-JIS sequences which would be valid as UTF-8, and any meaningfully long text is likely to stray outside that set.
I don't think it's fair to require "meaningfully long text" since when you're dealing with strings in programming they can often be of any arbitrary length.
Encoding detection is usually applied to a larger document, at the point it's ingested into an application. If you're applying it to short strings, something's not right -- where did those strings come from?
Taking an ID3 tag example, if you are mass-converting/sanitizing/etc. tag titles and other similar metadata, the strings are often very short, sometimes only even a single codepoint or character, and proper assumptions of encoding can not be relied on because so many people violate specs and put whatever they want in there, which is the whole point of wanting to sanitize the info in the first place.
I in fact can assume it. If the assumption is wrong then that's someone else's problem. 15 years ago I wrote a bunch of code using uchardet to detect encodings and it was a pretty useful feature at the time. In the last decade everything I've touched has required UTF-8 unless it's been interoperating with a specific legacy system which has some other fixed charset, and it's never been an issue.
There’s a difference between asssuming and not making a distinction.
Very few developers I’ve met know could make a distinction. They’d see a few off characters and think it’s some one-off bug but it’s because they’re both assuming an encoding.
Even if you said you’d pay them one billion dollars to fix it, they’d absolutely be unable to.
> Even if you said you’d pay them one billion dollars to fix it, they’d absolutely be unable to.
Unless you want it fixed immediately, then a million dollars should motivate almost any developer to spend a month learning, a month doing, and a few years on vacation. A billion is incomprehensible.
The way you get that reality is you do the opposite of the recommendation of Postel’s law: be very picky about what you consume and fail loudly if it’s not UTF-8
I haven't seen discussion of this point yet, but the post completely fails to provide any data to back up its assertion that charset detection heuristics works, because the feedback I've seen from people who actually work with charsets is that it largely doesn't (especially if you're based on naive one-byte frequency analysis). Okay, sure, it works if you want to distinguish between KOI8-R and Windows-1252, but what about Windows-1252 and Windows-1257?
I've done some charset detection, although it's been a while. Heuristics kind of work for somethings --- I'm a big fan of if it's decodable as utf-8, it's probably utf-8, unless there's zero bytes (in most text). If there's a lot of zero bytes, maybe it's UCS-2 or UTF-16, and you can try to figure out the byte order and if it decodes as utf-16.
If it doesn't fit in those categories, you've got a much harder guessing game. But usually you can't actually ask the source what it is, because they probably don't know and might not understand the question or might not be contactable. Usually, you have to guess something, so you may as well take someone else's work to guess, if you don't have better information.
Yeah. The fantastic python library ftfy ("fixes text for you", https://ftfy.readthedocs.io/en/latest/index.html), designed to fix mangled Unicode (mojibake, of many different varieties), mentions in its docs that heuristic encoding guessers are the cause of many of the problems ftfy is designed to fix. It's magical, by the way.
That section explains why not to use a specific naive charset detection library that doesn't have a strong prior for UTF-8. There's no basis for extrapolating that further.
I require UTF-8. If it isn't currently UTF-8, it's someone else's problem to transform it to UTF-8 first. If they haven't, and I get non-UTF-8 input, I'm fine bailing on that with a "malformed input - please correct" error.
That works until you can't pay your bills unless you take a new contract where you have to deal with a large amount of historical text files from various sources.
Even then, at that point I'm writing an adapter to convert into UTF-8 JIT in front of my service. Or modifying my program when that contract comes up, but I'm not going to waste time proactively safeguarding against a what-if that may come up at some point.
> I'm not going to waste time proactively safeguarding against a what-if
The article doesn't say that you should. It clearly states that for many cases, the input format is known or explicitly stated in the input headers.
The article talks about cases where the input files are in an unknown input format. Even then, it states: "Perhaps there is a case to be made that csvbase's auto-detection should be a more explicit user interface action than a pre-selected combo-box."
But for the case where the requirements call for heuristics, the article then talks about how that can be done.
> at that point I'm writing an adapter to convert into UTF-8 JIT in front of my service.
And at that point you might need the advice in the article, right?
> at that point I'm writing an adapter to convert into UTF-8 JIT in front of my service
Right, and at that point you're probably gonna need these statistics-based heuristics to write that adapter. Unless you know specifically what other encoding each bit of input is. If you do, then, again, you are not the target audience for this article.
If the Unicode consortium haven't been able to come up with a way of encoding their name correctly, I don't see what hope I have of doing so.
Bonus - as soon as the Unicode consortium do find a way, my software should be able to handle it with no further changes. Well, it might need a recompile against a newer `libicu` as I don't think they maintain ABI backcompat between versions. But there's not much I can do about that.
Unicode can't do it without a breaking change. But if you support non-unicode encodings in the traditional, documented and standard way then your application will handle it fine. Who knows, one day a successor to unicode may come out that can handle all languages properly within a single encoding, in which case an application written to be encoding-aware will support it without even a recompile.
What a bad, hyperbolic take. UTF-8 can encode the entire Unicode space. All you need is up-to-date libraries and fonts to display the codepoints correctly. It is backwards compatible forever. So requiring UTF-8 allows Japanese to represent their writing method exactly how it is and keep the scheme for a very long time with room to improve.
Japanese text in uft-8 is frequently rendered with the Chinese version of the kanji due to han unification, not "representing their writing method exactly". Shift-JIS encoding comunicates that the text is in Japanese via the encoding facilitating the correct font to be selected.
And it indeed facilitates, as in practice it works better than encoding in utf-8 that lacks an in-band way to communicate that, and out-of-band often fails/is ignored/doesn't exist.
I get that you're referring to Han Unification, but if software doesn't display unified glyphs with the correct style, that's an issue with the font rendering system, not Unicode. Sure, the font rendering system's job may have been easier had Unicode made different choices, but encoding-wise, Unicode is no more ambiguous than any other encoding it's round trip compatible with. The font rendering system is free to assume all unified glyphs should be rendered in a Japanese style, just like it would have with a Japanese-centric encoding.
> Switching your entire encoding system to set a different font is by far the stupidest way to do it.
If it's stupid and it works, it's not stupid. I wish there were other reliable ways to have international programs display Japanese correctly, but there aren't.
To be clear, we're talking about a program/library that can handle both unicode and shift-JIS, and it will render a character that unicode considers identical in different ways depending on what encoding you loaded the character from, right?
Firefox (not the best example for a number of reasons, if you're going to follow up then I'll talk about a different one, but if you really want just one then it's the program I have to hand right now).
> To be clear, we're talking about a program/library that can handle both unicode and shift-JIS, and it will render a character that unicode considers identical in different ways depending on what encoding you loaded the character from, right?
Though if you're making any attempt to use valid tags you'll have <html lang="ja">, and that solves the problem for the web context at least as far as my testing goes.
> Though if you're making any attempt to use valid tags you'll have <html lang="ja">
Right, which is why that's a bad example. But it's the same for most "normal" applications (even e.g. text editors, or I remember hitting this kind of thing in WinRAR), and a lot of the time there isn't a standard way of indicating the language/locale in the file. Even within firefox, there are (admittedly rare) cases where you're viewing something that isn't HTML and doesn't have HTTP headers so using the encoding or manually setting the page language is still the only way to make it work - and applications that have a manual "file language" setting are the exception rather than the rule.
My understanding is Unicode (and therefore UTF-8) can encode all the codepoints encodable by Shift JIS. I know that you need a language context to properly display the codepoints that have been Han Unified, so that could lead to display problems. But if we're trying to properly display a Japanese name, it's probably easier to put the appropriate language context in a UTF-8 document than it is to embed Shift JIS text into a UTF-8 document.
Realistically --- if someone hands me well marked Shift JIS content, I'm just going to reencode it as UTF-8 anyway... And if they hand me unmarked Shift JIS content, I'll try to see if I can decode it as UTF-8 and throw it away as invalid if not.
> My understanding is Unicode (and therefore UTF-8) can encode all the codepoints encodable by Shift JIS
Trivia: There are some variants of ShiftJIS where this isn't entirely true. The traditional yuugen gaisha symbol, for example, which is analogous to U+32CF (LTD), is not supported. The /VENDORS/APPLE/JAPANESE.TXT file uses a PUA designator and then a sequence of four Unicode code points to convert it.
> if we're trying to properly display a Japanese name, it's probably easier to put the appropriate language context in a UTF-8 document than it is to embed Shift JIS text into a UTF-8 document.
You'd think that, but in practice I've found the opposite. Applications that use encodings managed to display things properly. Applications that hardcode UTF-8 don't.
I'm no encoding geek, but have I'd say more than a passing familiarity with the issues involved. But I didn't know UTF-8 had in-band language signaling until today, so it perhaps doesn't surprise me that many applications implement it. (UI toolkits should, though... there's kinda no excuse for that.)
If you implement any kind of encoding support (that is, any kind of support for non-ASCII/non-unicode) you will probably have working Shift-JIS support even if you never test it, because Shift-JIS works the same as every other encoding you might test with. If you tested French or Spanish or really anything that wasn't English, you will display Japanese fine.
If you implement only unicode then you put yourself in a situation where Japanese is uniquely different from every other language, and your program will not work properly for Japanese unless you tested Japanese specifically.
This wouldn't solve your original problem since UTF8 also popular with Japanese, so just adding Shift-JIS isn't enough, so it comes down to the same basic thing: to support Japanese, you have to do some extra work to get and use extra info about a language, which is also possible within UTF8 counter to your initial broad claim of the opposite
> This wouldn't solve your original problem since UTF8 also popular with Japanese
Only among people who don't care. Implementing encoding support means your app supports a way to display Japanese properly. If you want to add more ways to display Japanese properly, go ahead, but that's supererogatory, whereas UTF8-only apps don't have a way to display Japanese properly at all.
> to support Japanese, you have to do some extra work to get and use extra info about a language, which is also possible within UTF8
It's not. There's some imaginary theorycrafted way in which it might notionally be possible within UTF-8, but not one UTF-8-only app has ever actually implemented support for displaying Japanese properly. The only approach to displaying Japanese properly that has ever actually been implemented in reality is to support multiple encodings (or to support only a Japanese encoding, but that has obvious downsides), and if you make your app encoding-aware then that's enough, you don't have to do anything else to be able to display Japanese properly for people who care about displaying Japanese properly (it's always possible for an app maker to go above and beyond, but I'm sure you'll agree the big difference is between an app that has a way to display proper Japanese at all and one that does not).
Are there Japanese characters missing in UTF-8? They should be added ASAP.
I know there's a weird Chinese/Japanese encoding problem where characters that kind-of look alike have the same character id, and the font file is responsible for disambiguation (terrible for multi-language content and we should really add more characters to create versions for each, but still the best we have).
IMHO the Unicode Consortium should standardize on using a variant selector for switching between Chinese and Japanese variants of unified Han characters. Best of both worlds: language-independent specificity while keeping the ability to have both in the same document.
Many Japanese websites have migrated from Shift-JIS to UTF-8, but this still ignores the fact that e.g. television captioning uses special characters[1] that are not found in UTF-8 or Shift-JIS. Windows itself has a habit of using its own Windows-932 encoding, which frequently causes problems in the Unix software I use. (e.g. Emacs fails at auto-detecting this format, and instructing Emacs to use Shift-JIS will result in decoding issues)
Windows-932 is the text encoding I dread the most. I wish literally everything could use Unicode, but we're not quite there yet. There's a reason encoding-japanese[1] on NPM has nearly 1 million weekly downloads. Unfortunately, I have to use it in one of my React Native applications since the one of servers I speak to returns text encoded as Shift-JIS and Hermes does not implement JavaScript's TextDecoder API. Shift-JIS is marginally better than the Windows code page, but I'd really prefer UTF-8.
More or less. The proportion of Japanese websites that use Shift-JIS is actually increasing, for example. (It's true that the absolute number of Japanese websites using UTF-8 is increasing, but that's misleading - it's only due to the overall growth of the web).
The parent said "displayed correctly", not "encoded".
For example, if I want to talk about the fairly rare japanese name '刃一' (jinichi), there's a chance your computer displays it correctly, but there's also a chance your computer displays the chinese variant of the first character, making it look wrong. It's basically up to font-choice of your computer.
The "correct" way to fix that would be for me to be able to tag it with 'lang=ja', but Hacker News doesn't let me include html tags or some other 'language selector' in my comment, so I'm unable to indicate whether that's supposed to be the chinese or japanese variant of the character.
Most unicode text files don't have extra metadata indicating if a certain bit of text is japanese or chinese, so displaying it correctly by adding the correct 'lang' tag is impossible, especially since it's perfectly possible for one utf-8 text to mix both chinese and japanese.
I didn't propose any solution or claim Shift-JIS fixes this. It doesn't really since a single Shift-JIS document can only encode the japanese variant, not both.
However, a unicode codepoint which acted as a "language hint" switch would solve this, and wouldn't require doubling the number of han codepoints.
There already are unicode variant selectors, but iiuc they only apply to a single character at a time, and no one actually uses them, so they're not very useful.
The Shift-JIS codepoints for the characters of that name are understood to refer to Japanese characters, so fonts render them correctly. Encoding-aware programs have different representations (such as GB 18030 codepoints) for the similar-but-different Chinese characters that unicode-only programs tend to display these characters as, and so will render them differently.
If I understand you correctly, it means that the encoding itself serves as the metadata that indicates Chinese/Japanese. In which case, why is it unreasonable to ask for the same for UTF-8, except using some more clearly specified way to indicate this (like lang="ja" etc), rather than encoding it all into separate characters?
> If I understand you correctly, it means that the encoding itself serves as the metadata that indicates Chinese/Japanese.
Only if you have some unicodebrained mentality where you consider a Chinese character that looks sort of similar to a Japanese character to be "the same". If you think of them as two different characters then they're just different characters, which may or may not be present in particular encodings (which is completely normal if you make a program that handles multiple encodings: not every character exists in every encoding)
> In which case, why is it unreasonable to ask for the same for UTF-8, except using some more clearly specified way to indicate this (like lang="ja" etc), rather than encoding it all into separate characters?
Firstly, you have to "draw the rest of the fucking owl" and actually implement that language selection mechanism. Secondly, if you implement some clever extension mechanism on top of UTF-8 that's only needed for Japanese and can only really be tested by people who read Chinese or Japanese, realistically even if you implement it perfectly, most app makers won't use it or will use it wrong. Whereas if you implement encoding-awareness in the standard way that we've been doing for decades and test with even one non-default encoding, your application will most likely work fine for Japanese, Chinese and every other language even if you never test it with Japanese.
Any of the well-known han unification examples. People claim this can be solved with some kind of out-of-band font selection vaporware, but the kind of person who thinks it's fine to demand UTF-8 everywhere never actually stoops to implement this imaginary font selection functionality.
Well, I must nitpick: Shift JIS is actually just one of those out-of-band vaporware methods. Sure, it's better supported because of legacy, but new code that doesn't care about handling some lang metadata is not going to care about Shift JIS either.
Of course, there is no correct solution (people in other comments seem to believe it exists a bit too quickly). Dynamic HTML page can load pieces of text written by people all across the world, it can't rely on any “main” language. Those people can't be automatically classified (based on some browser settings or location). Then there are always people who know both Chinese and Japanese (while using some other system locale, to make things more complex). It is wrong to assume that they should not be able to use Chinese forms and Japanese forms at once, even inside a single paragraph or phrase.
I wonder why Unicode has not simply introduced some “combining language form character” to make user choice stick. After all, there's a whole subsystem for emoji modification, and those things were once “weird Japanese texting customs”. As for complexity of handling Unicode text, it asymptotically reaches its maximum anyway.
Wait a second, there are “variation selectors” and some “Ideographic Variation Database”. Is it the solution? Can IMEs and converters simply stamp each variable character with invisible mark based on current input language? I suppose there's some catch…
> Shift JIS is actually just one of those out-of-band vaporware methods.
It's the opposite of vaporware; there's a whole bunch of well-known software that handles encodings correctly.
> new code that doesn't care about handling some lang metadata is not going to care about Shift JIS either.
If you support encodings the way almost every programming language tells you too, you'll handle Shift-JIS just fine. If you use the "legacy" encodings approach and test even one non-English language you'll handle Japanese fine.
> Can IMEs and converters simply stamp each variable character with invisible mark based on current input language? I suppose there's some catch…
They're officially deprecated, IMEs and converters don't use them, naïve search implementations (which is to say most search implementations, because people do not test for an obscure edge case) break, ....
But there is no “naive” implementation of Unicode. It won't handle emoji, normalization, and thousands other things. People use The Library anyway. Adding another mutagen to it should be no different from the rest.
As for official stance, maybe it's time for some group to agree on certain non-conflicting sequences, and implement them in some library/stack. A particularly evil solution would choose arbitrary existing combining characters that can't be declared “illegal” retroactively.
By the way, is it possible to make proper quotes from other languages as described above when using Shift JIS?
> But there is no “naive” implementation of Unicode.
I assure you there are thousands if not millions of apps that implement search the naive way, by comparing bytes.
> It won't handle emoji, normalization, and thousands other things.
Yeah. But it works well enough for Americans, so the maintainers don't care enough to fix it.
> As for official stance, maybe it's time for some group to agree on certain non-conflicting sequences, and implement them in some library/stack.
And then what? What do you do with documents in Japanese that have already been converted without using your new codepoints? What do you do with programs that ignore your new standard? What do you do about search being broken in most apps with no prospect of fixing it? There are already some PUA codepoints for a proper implementation of Japanese, but most app authors don't even understand the problem, never mind being willing to support something that's "non-standard". Asking them to support traditional non-unicode encodings, which is something that's at least relatively well-known and standardised, and something they can test without knowing Japanese, is much easier.
> By the way, is it possible to make proper quotes from other languages as described above when using Shift JIS?
No, not for arbitrary languages. If you want to mix arbitrary languages, you need some structure that's a rope of strings with an encoding per segment. But that's exactly the same thing that the unicode advocates claim is easy ("just have a sequence of spans with different lang tags"), and at least if you use traditional encodings then you don't have to do any such fancy business for the much more common case of a file that's entirely in Japanese (perhaps with some ASCII keywords).
> implement this imaginary font selection functionality
We are already doing GSUB/GDEF tables and a lot of horrible stuff to display pretty glyphs on the screen. Hell, there are literal instructions for VM in TTF files to help with pretty rasterization on low-res screens.
Making font rendering library is hard, Nintendo hard.
That's just a nature of the beast. Fonts are messy and if we want one standard to deal with it once and for all, that means some compromises must be done. CJK is not some obscure alphabet. In this case, perfect is enemy of good.
I am just glad there really is only one (non-niche) standard, obligatory https://xkcd.com/927/
Right. One would think that misrendering a major world language used by over 100 million people would be an issue that warrants some attention. But too many of the HN crowd don't care.
> In this case, perfect is enemy of good.
It's not though. Unicode-only is not just imperfect, it's an outright regression for Japanese. Meanwhile traditional encoding-aware programs render Japanese just fine.
> It's not though. Unicode-only is not just imperfect, it's an outright regression for Japanese.
Unicode has variant selectors which can deal with region variants (e.g. https://imgur.com/a/syMcWNO) to deal with the issue.
Granted, it's not very widely used, but Unicode provided a solution. The onus is now on application developers and font providers.
Choosing to limit number of characters to 2 bytes was a good technical choice back in 1992, non-unified CJK wouldn't fit (that was a time of 2-4MB RAM and 100-200MB HDD). A solution to problem was later provided.
Using Shift-JIS is like doing on-IP, non-UDP protocol on the internet.
Sure, you can do it, but it's just non-perspective choice. A better invested time is on improving standard protocol, which is Unicode (e.g. conversion software).
> Unicode has variant selectors which can deal with region variants (e.g. https://imgur.com/a/syMcWNO) to deal with the issue.
They're officially deprecated and cause issues with breaking search etc.. So they're still below feature parity with using a traditional encoding. Traditional encoding support is also easier to test since documents with traditional encodings are more widespread and exist for many world languages, not just Japanese.
Unless you're using an OS older than Windows 2000, or a linux distro from the 2000s, where some form of Unicode was not the default encoding, or maybe an ancient Win32 program compiled without "UNICODE" defined, it shouldn't be a problem. I specifically work with a lot of Japanese software and have not seen this problem in many years.
And even back in the mid 2000s, the only real problems I saw otherwise, were things like malformed html pages that assumed a specific encoding that they wouldn't tell you, or an MP3 file with an ID3 tag with CP932 shoved into it against the (v1) spec.
I also disagree with the author that Shift-JIS can be "good enough" hueristically detected due to its use of both 7 and 8-bit characters in both the high and low bytes to mean different things depending on what character is actually intended. Even string searching requires a complex custom-made version just for Shift-JIS handling.
The article's pretty weird for presenting little-endian UTF-16 as normal and barely even mentioning that big-endian is an option (in fact, seems to refer to it as "backwards"), even though big-endian is a much more human readable format.
Java and Javascript both use UTF-16 for internal string representation, even though JSON specifies UTF-8. Windows APIs too. I'm still not sure why, but it means that one char uses at least 2 bytes even if it's in the ASCII range.
Early adopters of Unicode used the first available encoding, UCS-2. UTF-16 is an extension of that to handle the increased range of code points that came later.
Why would anyone use anything other than UTF-8 in this day and age?
Windows took a gamble years ago when the winner was not obvious, we’re stuck with UCS-2, but you can circumvent that with a good string library like QT’s Qstring.
However you can check for invalid utf-8 sequences, throw an error with "invalid encoding on byte x, please use valid utf-8" if encountered, and from that point on assume utf-8.
But you can assume non-UTF-8 upon seeing an invalid UTF-8 byte sequence. From there, it can be application-specific depending on what encoding you expect to see.
Dear lazyweb: I think I read something about Postel's Law being essential to the internet's success -- maybe this was also IPv6 related? Does anyone else remember this?
I maintain a system I created in 2004 (crazy right). Not sure how we lived; but at the time, emojis were not as much of a thing. This has come to bite me several times.
I'm actually having this issue where users import CSV files that don't seem to be valid. DuckDB would throw out an error like: Invalid Closing Quote: found non trimable byte after quote at line 34, Invalid Input Error: Invalid unicode (byte sequence mismatch) detected in value construction, Value with unterminated quote found
One example: pgAdmin can export a database table into a CSV... but the CSV isn't valid for DuckDB to consume. Because, for some odd reason, pgAdmin uses a single quote to escape a double quote.
This blog is pretty timely. Thank you for writing it!
Pretty hard. There's no true way of knowing when you're right. You can make educated guesses based on statistical likelihood of certain patterns, but nothing stops people from constructing text that happens to look more "normal" when interpreted as another character set.
If I see the bytes 0xE5 0xD1 0xCD 0xC8 0xC7, then it is a decent bet that it's intended to be مرحبا in DOS 708. But there's no definitive reason why it couldn't be σ╤═╪╫ (DOS 437) or åÑÍÈÇ (Windows-1252) or Еямих (KOI8-RU). Especially if you don't know for sure that the data is intended to be natural language.
I can be pretty reasonably certain that it isn't, say, Japanese. 0xE5 in ShiftJIS is the first of a double-byte sequence, and it's odd, so the next byte would have to come from the range 0x40-0x9E, and 0xD1 does not fall within that range. In other words, it simply isn't valid ShiftJIS (it also isn't valid UTF-8). So you can narrow down the possibilities... but you're still making an educated guess. And you have to figure out a way to program that fuzzy, judgment-y analysis into software.
I've written code to do this. If you're lucky there will be a BOM (byte order mark) or MIME type to indicate the encoding form. In this case you know the encoding. If you don't have this information, then you must guess the encoding. The issue is guessing may not produce accurate results, especially if you don't have enough bytes to go by.
The program I wrote to guess the encoding would scan the bytes in multiple passes. Each pass would check if the bytes encoded valid characters in some specific encoding form. After a pass completed I would assign it a score based on how many successfully encoded characters were (or were not) found. After all passes completed I'd pick the highest score and assume that was the encoding. This approach ended up being reasonably reliable assuming there were enough bytes to go by.
But I will, because in this day and age, I should be perfectly able to do so.
Non-use of UTF-8 should be simply considered a bug and not treating text as UTF-8 should frankly be a you problem. At least for anything reasonably modern, by which I mean made in the last 15 years at least.
Do we really need 128 permutations just to express an alphabet of 26 letters?
I think we should use a 4 bit encoding.
0 - NUL
1-7 - aeiouwy
8 - space
9-12 - rst
13-15 - modifiers
When modifier bits are set, the values of the next half-byte change to represent the rest of the alphabet, numbers, symbols, etc. depending on the bits set.
As someone who has repeatedly had to deal with Unicode nonsense; I wholeheartedly agree. Also, you don’t need accents. You just need to know how to read and have context. See: live and live, read and read, etc.
I suspect the author does not about the first bit being 1 thing.
utf8 is magic.
You can assume US ASCII for lots of very useful text protocols
, like http and Stomp and not care what the variable string bytes mean.
Soooo many software architects don't grok the magic of it.
You can define 8bit parser that check for "a:"(some string)\n
and work with a shit load of human languages.
The company I work for does not realise that most of f the 50 year old legacy C it has is fine with utf8 for all the arbitrary fixed length or \0 terminated strings it stores.
I wish I could live in the world where I could bluntly say "I will assume UTF-8 and ignore the rest of the world". Many Japanese documents and sites still use Shift JIS. Windows has this strange Windows-932 format that you will frequently encounter in CUE files outputted by some CD ripping software. ARIB STD-B24, the captioning standard used in Japanese television, has its own text encoding with characters not found in either JIS X 0201 or JIS X 0208. These special characters are mostly icons used in traffic and weather reports, but transcoding to UTF-8 still causes trouble with these icons.
It's more like "I will assume UTF-8 and ignore edge case encoding problems which still arise in Japan, for some strange reason".
We are not running short on Unicode codepoints. I'm sure they can spare a few more to cover the Japanese characters and icons which invariably get mentioned any time this subject comes up on HN. I don't know why it hasn't happened and I won't be making it my problem to solve. Best I can do is update to version 16 when it's released.
I mention Japanese because I deal with Japanese text daily. I could mention some Chinese documents and sites using GBK to save space (since such encodings use exactly 2 bytes per character whereas the average size in UTF-8 is strictly larger than 2 bytes). But I am not very familiar with it. Overall, I would not say these are "strange reasons".
Other encodings exist, yes. But they can all be mapped to UTF-8 without loss of information[0]. If someone wants to save space, they should use compression, which will reduce any information, regardless of encoding, to approximately the same size. So it's perfectly reasonable to write software on the assumption that data encoded in some other fashion must be first reëncoded as UTF-8
[0]: Except Japanese, people hasten to inform us every time this comes up. Why? Why haven't your odd characters and icons been added to Unicode, when we have cuneiform? That's the strange part. I don't understand why it's the case.
Unicode did kind of dumb thing with CJK, the unified Chinese and Japanese kanjis makes displaying the CJK text much harder problem than it should be, as it now relies also on a language specific font to be displayed correctly[0]. I guess this could be bandaided by some sort of language marker in the UTF8 bytestring which then a text shaping engine would have to understand and switch the font accordingly..
Kind of a band-aid (it's necessary to stuff a variant selector after a CJK codepoint), but should work.
These decisions were made back in 1992 and codepoint in 16-bit was one of desired goals. Non-unified CJK wouldn't fit. In hindsight, it looks like a rather unfortunate decision, but having more codepoints that would fit to 16 bits could seriously hamper adoption and different standard would win (compute resources were far more limiting back then).
In either case, it's like 4 byte addressing in IPv4, in hindsight, 6+ bytes would be better, but what's done is done.
Edit: Even in 2000s, when C# was released, string was just a sequence of 16-bit code units (not codepoints), so they could deal with BMP without problems and astral planes were ... mostly DIY. They added Rune support (32-bit codepoint) only in .NET Core 3.0 (2019).
They aren't strange, but they are sort of self-inflicted, so it's not unreasonable for others to say, "we're not going to spend time and effort to deal with this mess".
I'm Russian. 20 years ago that meant having to deal with two other common encodings aside from UTF-8 (CP1251 and KOI8-R). 25 years ago, it was three encodings (CP866 was the third one). Tricks like what the article describes were very common. Things broke all the time anyway because heuristics aren't reliable.
These days, everything is in UTF-8, and we're vastly better off for it.
Unless the Unicode Consortium decides to undo the Han Unification stuff, I don't think it's going to get better for Japanese users, and programmers who build for a Japanese audience will have to continue to suffer with Shift-JIS.
There will be no undoing of anything, fortunately. Unicode is committed to complete backward compatibility, to the point where typos in a character name are supplemented with an alias, rather than corrected. Han Unification was an unforced error based on the proposition, which was never workable, that sixteen bits could work for everyone. This is entirely Microsoft's fault, by the way. But it shouldn't be, and won't be, fixed by breaking compatibility. That way lies madness.
There are two additional planes set aside for further Hanzi, the Supplementary and Tertiary Ideographic Planes, the latter is still mostly empty. Eventually the last unique ideograph used only to spell ten known surnames from the 16th century will also be added as a codepoint.
I view the continued use of Shift-JIS in Japan as part of a cultural trend, related to the continued and widespread use of fax machines, or the survival of floppy disks for many years after they were effectively dead everywhere else. That isn't at all intended as an insult, it's that matters Japanese stay within Japan to a high degree. Japanese technology has less outside pressure for cross-compatibility.
Shift-JIS covers all the corner cases of the language, and Unicode has been slow to do likewise, and it isn't like Japanese computers don't understand UTF-8, so people have been slow to switch. It's the premise of "unaware of how it works in the rest of the world" that I object to. It's really just Japan. Everywhere else, including the Chinese speaking parts of the world, there's Unicode data and legacy-encoded data, and the solution to the later is to encode it in the former.
> ARIB STD-B24, the captioning standard used in Japanese television, has its own text encoding with characters not found in either JIS X 0201 or JIS X 0208.
Amazingly enough, ARIB STD-B24 is one of the major source of Unicode emojis. So transcoding would actually work for them! (I am aware of some exceptions, but semantically there is no loss.) Unicode and UTF-8 are truly eating every other legacy encoding else, so much so that it is becoming more reasonable to have a separate transcoding step.
But remember, adding emoji to a character encoding standard is Morally Bad, somehow, and Proof Of Intellectual Decay In The Modern World. Also, Unicode invented emoji to Sap And Impurify Our Bodily Fluids and there is no reason for any of them to exist.
Most people on this site probably live in the world where everything is done in English. That's the norm for the vast majority of businesses and people in the US.
Even for those people, there is still a ton of old text files in Windows-1252 etc floating around.
You can choose to never work on projects where you have to deal with files like that.
But there may come a day where you have to choose between not paying your rent or writing a tool that converts old textfiles to UTF-8. At that point, it's nice to have references on the internet on how other people have actually dealt with it and what works. "Abort with an error" is not very useful advice then.
Why would you write a tool that does that instead of just digging up one that's already written? This sounds like the folly of writing ones own encryption library.
As far as implementing new tech goes, this sounds like about the easiest research project you could ask for. Grab a few and test em out. Not sure why you're making this out to be some kind of intractable problem.
I do. I cut my IT teeth on an old ass system from the 80s in the 00's. I remember having problems feeding the reports into modern systems. Goofy problems with eol and eof and some other hiccups. It wasn't that bad.
Honestly if you can't review/read/figure that out without writing a library of your own, you probably shouldn't be writing a library of your own in the first place.
You can't pick and use a library like this without understanding the underlying concepts. That goes for both encryption and the charset conversion issue. It's not always just plug and play.
There are examples of where people used encryption libraries in the wrong way and undermined the strength of the encryption (for example, CVE-2024-31497 in PuTTY).
A very big part of software development is dealing with leaky abstractions. We don't work with perfect black boxes. We need to understand enough of how things works in the lower layers to avoid problems. Note here that I wrote "enough", not "everything", or "write everything yourself".
I would not want a person writing software to handle charset conversion if he refuses to learn how the various encodings work, which charset will decode as another charset or not, etc.
Your example seems to be due to, paraphrasing, lack of a library rather than inability to choose one.
"older approach, PuTTY developers said, was devised at a time when Microsoft Windows lacked native support for a cryptographic random number generator."
So "enough of how things work" could just be "pick a modern encryption library" that doesn't come from the dark ages when t there were no random numbers
Same with encodings, it requires a much lower level understanding to pick a library, you can rely on the expertise of others
Fascinating topic. There are two ways the user/client/browser receives reports about the character encoding of content. And there are hefty caveats about how reliable those reports are.
(1) First, the Web server usually reports a character encoding, a.k.a. charset, in the HTTP headers that come with the content. Of course, the HTTP headers are not part of the HTML document but are rather part of the overhead of what the Web server sends to the user/client/browser. (The HTTP headers and the `head` element of an HTML document are entirely different.) One of these HTTP headers is called Content-Type, and conventionally this header often reports a character encoding, e.g., "Content-Type: text/html; charset=UTF-8". So this is one place a character encoding is reported.
If the actual content is not an (X)HTML file, the HTTP header might be the only report the user/client/browser receives about the character encoding. Consider accessing a plain text file via HTTP. The text file isn't likely to itself contain information about what character encoding it uses. The HTTP header of "Content-Type: text/plain; charset=UTF-8" might be the only character encoding information that is reported.
(2) Now, if the content is an (X)HTML page, a charset encoding is often also reported in the content itself, generally in the HTML document's head section in a meta tag such as '<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>' or '<meta charset="utf-8">'. Now just because an HTML document self-reports that it uses a UTF-8 (or whatever) character encoding, that's hardly a guarantee that the document does in fact use said character encoding.
Consider the case of a program that generates web pages using a boilerplate template still using an ancient default of ISO-8859-1 in the meta charset tag of its head element, even though the body content that goes into the template is being pulled from a database that spits out a default of utf-8. Boom. Mismatch. Janky code is spitting out mismatched and inaccurate character encoding information every day.
Or to consider web servers. Consider a web server whose config file contains the typo "uft-8" because somebody fat-fingered while updating the config (I've seen this in random web pages.). Or consider a web server that uses a global default of "utf-8" in its outgoing HTTP headers even when the content being served is a hodge-podge of UTF-8, WINDOWS-1251, WINDOWS-1252, and ISO-8859-1. This too happens all the time.
I think the most important takeaway is that with both HTTP headers and meta tags, there's no intrinsic link between the character encoding being reported and the actual character encoding of the content. What a Web server tells me and what's in the meta tag in the markup just count as two reports. They might be accurate, they might not be. If it really matters to me what the character encoding is, there's nothing for it but to determine the character encoding myself.
I have a Hacker News reader, https://www.thnr.net, and my program downloads the URL for every HN story with an outgoing link. I have seen binary files sent with a "UTF-8" Content-Type header. I have seen UTF-8 files sent with a "inode/x-empty" Content-Type header. My logs have literally hundreds of goofy inaccurate reports of content types and character encodings. Because I'm fastidious and I want to know what a file actually is, I have a function `get_textual_mimetype` that analyzes the content of what the URL's web server sends me. My program downloads the content and uses tools such as `iconv` and `isutf8` to get some information about what encoding it might be. It uses `xmlwf` to check if it's well-formed XML. It uses `jq` to check whether it's valid JSON. It uses `libmagic`. There's a lot of fun stuff the program does to pin down with a high degree of certainty what the content is. I want my program to know whether the content is an application/pdf, an iamge/webp, a text/html, an application/xhtml+xml, a text/x-csrc, or whatever. Only a rigorous analysis will tell you the truth. (If anyone is curious, the source for `get_textual_mimetype` is in the repo for my HN reader project: https://github.com/timoteostewart/timbos-hn-reader/blob/main... )
We don't go "oh that input that's supposed to be json? It looks like a malformed csv file, let's silently have a go at fixing that up for you". Or at least we shouldn't, some software probably does.