You can't just assume UTF-8

JonChesterfield · on April 29, 2024

How about assume utf-8, and if someone has some binary file they'd rather a program interpret as some other format, they turn it into utf-8 using a standalone program first. Instead of burning this guess-what-bytes-they-might-like nonsense into all the software.

We don't go "oh that input that's supposed to be json? It looks like a malformed csv file, let's silently have a go at fixing that up for you". Or at least we shouldn't, some software probably does.

ezoe · on April 29, 2024

I doubt you can handle UTF-8 properly with that attitude.

The problems is, there is one very popular OS which is very hard to enforce UTF-8 everywhere, Microsoft Windows.

It's very hard to ensure all the software stack you are depending on it use Unicode version of Win32 API. Actually the native character encoding in Windows is UTF-16 so you can't just assume UTF-8. If you're writing low level code, you have to convert UTF-8 to UTF-16 and back. Even if you don't you have to ensure all the low level code you are depending on it do the same for you.

Oh and don't forget about the Unicode Normalizations. There is no THE UTF-8. There are bunch of UTF-8s with different Unicode normalizations. Apple macOS use NFD while other mostly use NFC.

These are Just some examples. When people living in ASCII world casually said "I just assume UTF-8", in reality, you still assume it's ASCII.

josephg · on April 30, 2024

> Actually the native character encoding in Windows is UTF-16 so you can't just assume UTF-8. If you're writing low level code, you have to convert UTF-8 to UTF-16 and back.

Yes. You should convert your strings. Thankfully, UTF-16 is very difficult to confuse with UTF-8 because they're completely incompatible encodings. Conversion is (or should be) a relatively simple process in basically any modern language or environment. And personally, I've never run into a problem where the difference between NFC and NFD mattered. (Do you have an example?). The different forms are (or should be) visually completely identical for the user - at least on modern computers with decent unicode fonts.

The largest problem with UTF-8 (and its biggest strength) is how similar it is to ASCII. It is for this reason we should consider emoji to be a wonderful gift to software correctness everywhere. Correctly handling emoji requires that your software can handle unicode correctly - because they need multi-unit encoding with both UTF-16 and UTF-8. And emoji won't render correctly unless your software can also handle grapheme clusters.

> When people living in ASCII world casually said "I just assume UTF-8", in reality, you still assume it's ASCII.

Check! If your application deals with text, throw your favorite multi-codepoint emoji into your unit testing data. (Mine is the polar bear). Users love emoji, and your software should handle it correctly. There's no excuse! Even the windows filesystem passes this test today.

hnfong · on April 30, 2024

This.

My native language uses some additional CJK chars on plane 2, and before ~2010s a lot of software had glitches beyond the basic plane of unicode. I am forever grateful for the "Gen Z" who pushed for Emojis.

Javascript's String.length is still semantically broken though. Too bad it's part of a unchangeable spec...

winternewt · on April 30, 2024

There's no definition of String.length that would be the obvious right choice. It depends on the use case. So probably better to let the application provide its own implementation.

josephg · on April 30, 2024

> So probably better to let the application provide its own implementation.

I’d be very happy with the standard library providing multiple “length” functions for strings. Generally I want three:

- Length in bytes of the utf-8 encoded form. Eg useful for http’s content-length field.

- Number of Unicode codepoints in the text. This is useful for cursor positions, CRDT work, and some other stuff.

- Number of grapheme clusters in the text when displayed.

These should all be reasonably easy to query. But they’re all different functions. They just so happen to return the same result on (most) ascii text. (I’m not sure how many grapheme clusters \0 or a bell is).

Javascript’s string.length is particularly useless because it isn’t even any of the above methods. It returns the number of bytes needed to encode the string as UTF16, divided by 2. I’ve never wanted to know that. It’s a totally useless measure. Deceptively useless, because it’s right there and it works fine so long as your strings only ever contain ascii. Last I checked, C# and Java strings have the same bug.

yau8edq12i · on April 30, 2024

I don't know about Java, but the C# standard library is exceptionally well design with respect to variable byte encoding. https://learn.microsoft.com/en-us/dotnet/standard/base-types...

The built-in string.length method is useless (it returns the number of char objects) and I agree that's a problem, but the solution is also built into the language, unlike in JS.

LegionMammal978 · on April 30, 2024

JS these days also has ways to iterate over codepoints and grapheme clusters. If you treat the string as an iterator, then its elements will be single-codepoint strings, on which you can call .codePointAt(0) to get the values. (The major JS engines can allegedly elide the allocations for this.) The codepoint count can be obtained most simply with [...string].length, or more efficiently by looping over the iterator manually.

The Intl.Segmenter API [0] can similarly yield iterable objects with all the grapheme clusters of a string. Also, the TextEncoder [1] and TextDecoder [2] APIs can be used to convert strings to and from UTF-8 byte arrays.

[0] https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

[1] https://developer.mozilla.org/en-US/docs/Web/API/TextEncoder

[2] https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder

singpolyma3 · on April 30, 2024

Don't you want grapheme clusters for cursor positions?

Length in encoded form can be found after encoding by checking the length of the binary content I guess.

I think for historical reasons access to codepoints can be useful, but it's rarely what one wants.

CryZe · on April 30, 2024

There's Intl.Segmenter now which does Unicode Segmentation to count the amount of graphemes for example: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

Though you are right in that I don't know of a built-in way to count Unicode Scalar Values (USVs).

josephg · on April 30, 2024

JavaScript’s recently added implementation of String[Symbol.iterator] iterates through Unicode characters. So for example, [...str] will split any string into a list of Unicode scalar values.

josephg · on April 30, 2024

Yep. I don't use eslint, but if I did I would want a lint against any use of string.length. Its almost never what you want. Especially now that javascript supports unicode through [...str].

extraduder_ire · on April 30, 2024

String.length is fine, since it counts UTF16 (UCS2?) code units. The attribute was only accidentally useful for telling how many characters were in a string for a long time, so people think it should work that way.

extraduder_ire · on April 30, 2024

> I've never run into a problem where the difference between NFC and NFD mattered. (Do you have an example?)

The main place I've seen it get annoying is searching for some text in some other text. Unless you normalize the data you're searching through the same way as you normalize your search string.

josephg · on April 30, 2024

After reading the above comment I went looking for Unicode documentation talking about the different normalisation formats. That was a point I read that surprised me because I hadn’t ever thought of it. They said search should be insensitive to normalisation form - so generally you should normalise all text before running a search.

That’s a great tip - obvious in hindsight but one I’d never considered.

kevin_thibedeau · on April 30, 2024

> your software should handle it correctly. There's no excuse!

It is valid for the presentation of compound emoji can fallback to their component parts. You can't expect every platform to have an up to date database of every novel combination. A better test is emoji with color modifiers. Another good one is grandfathered symbols with both a text and emoji presentation and forcing the chosen glyph with the variant selector prefix.

josephg · on April 30, 2024

> You can't expect every platform to have an up to date database of every novel combination.

On modern desktop OSes and smart phones, I do expect my platform to have an up-to-date unicode database & font set. Certainly for something like the unicode polar bear, which was added in 2020. I'll begrudgingly look the other way for terminals, embedded systems and maybe video games... but generally it should just work everywhere.

Server code generally shouldn't interact with unicode grapheme clusters at all. I'm struggling to think of any common, valid reason to use a unicode character database in 'normal' backend server code.

> Another good one is grandfathered symbols with both a text and emoji presentation and forcing the chosen glyph with the variant selector prefix.

I didn't know about that one. I'll have to try it out.

wheybags · on April 30, 2024

> I'm struggling to think of any common, valid reason to use a unicode character database in 'normal' backend server code.

Case insensitive search

kevin_thibedeau · on April 30, 2024

Unicode gets used in places that aren't continuousky updated. Good luck showing a pirate flag emoji on an embedded device like an infotainment system.

PeterisP · on April 30, 2024

I thing being continuously updated should be tied to receiving new external content.

It's fine to have an embedded device that's never updated, but never receives new content - it doesn't matter that a system won't be able to show a new emoji because it doesn't have any content that uses that new emoji.

However, if it is expected to display new and updated content from the internet, then the system itself has to be able to get updated and actually get updated, there's no acceptable excuses for that - if it's going to pull new content, it must also pull new updates for itself.

ndriscoll · on April 30, 2024

As the user/owner of the device, no thanks. It should have code updates if and only if I ask it to, which I probably won't unless you have some compelling reason. For the device owner, pulling new updates by itself is just a built in backdoor/RCE exploit, and in practice those backdoors are often used maliciously. I'd much rather my devices have no way to update and boot from ROM.

bee_rider · on April 30, 2024

So like a CD player needs some way to get updates? I guess they could send out CDs with updates but approximately nobody would actually do that.

PeterisP · on May 2, 2024

The fact that we have to go as far back as CD players for a decent example illustrates my point - the "CD player" content distribution model is long dead, effectively nobody sells CD players or devices like CD players, effectively nobody distributes digital content on CDs or things like CDs (like, CD sales are even below vinyl sales) - almost every currently existing product receives content through a channel where updates could trivially be sent as well.

And if we're talking about how new products should be designed, then the "almost" goes away and they 100% wouldn't receive new content through anything like CDs, the issue transforms from an irrelevant niche (like CDs nowadays) to a nonexistent one.

kps · on April 30, 2024

> Another good one is grandfathered symbols with both a text and emoji presentation and forcing the chosen glyph with the variant selector prefix.

I despise that Unicode retroactively applied default emoji presentation to existing symbols, breaking old text. Who the hell though that was a good idea?

ezoe · on April 30, 2024

That's one good thing emoji bring to the software developers mind set.

Before emoji, if somebody open a bug report like: "Your software doesn't handle UTF-8 correctly. It doesn't handle Japanese.",

the response was "Huh? We don't bother to support Japanese. Go pound sand. Close ticket with wontfix.".

Now it's "Your software doesn't handle UTF-8 correctly. It doesn't handle emoji" and we're like "Oh shit! My software can't handle my beloved emoji!"

sharpshadow · on April 30, 2024

Exactly I was and still am surprised how fast and wide the adoption of emojis went.

pasc1878 · on April 30, 2024

Possibly because the major software companies made it work on phones. Thus users saw it working on many apps and complain when your app fails to do this.

Google, Apple, IBM and MS also did a lot of localisation so their code bases deal with encoding.

It is FOSS Unix software that had the ASCII mindset, probably as C and C++ string types are ASCII and many programmers want to treat strings as arrays. The MacOS and Windows APIs do take UTF as their input not char * (agreed earlier versions did not put they have provided the UTF encodings for 25 years at least.

marcosdumay · on April 30, 2024

> And personally, I've never run into a problem where the difference between NFC and NFD mattered.

You mean like opening a file by name?

eclipticplane · on April 30, 2024

> (Mine is the polar bear).

Mine is the crying emoji.

And after enough failures in breaking the system, the 100 emoji.

josephg · on April 30, 2024

Those are ok, but both of those emoji are represented as a single unicode codepoint. Some bugs (particularly UI bugs) only show up when multiple unicode characters combine to form a single grapheme cluster. I'd recommend something fancier.

I just tried it in gnome-terminal, and while the crying emoji works fine, polar bear or a country flag causes weird issues.

rocqua · on April 30, 2024

Crying emoji but with a different skin color?

josephg · on April 30, 2024

That’d work!

thayne · on April 30, 2024

> Apple macOS use NFD while other mostly use NFC.

It's actually worse than that.

Older versions of mac did enforce NFD for file names, but more recent names don't, at least at the OS level. But many apple programs, such as finder _will_ use NFD. Except that it isn't even Unicode standardized NFD, it is Apple's own modified version of it. And this can cause issues when for example you create a file in finder, then search for it using `find`, and type the name of the file the exact same way, but it can't find the file because find got an NFC form, but the actual file is in NFD.

OTOH, in many applications, you don't really care about the normalization form used. For example, if you are parsing a CSV, you probably don't need to worry about if one of the cells using using a single code point or two code points to represent that accented e.

neonsunset · on April 30, 2024

Thanks, yet another quantum of knowledge that makes one's life irreversibly ever so slightly worse. But not as bad as encryption (and learning all the terrible ways most applications have broken implementations in)

staunton · on April 30, 2024

> most applications have broken implementations

What applications? Almost nobody writes their own implementations of encryption nowadays (nor should they). You mean openssl is "broken"?

neonsunset · on April 30, 2024

By broken implementations I meant incorrect usage of cryptographic APIs - padding errors, nonce reuse, using weak hash functions, etc.

ezoe · on April 30, 2024

Yeah, I know that but omit it to make my comment shorter. The world will be a slightly better if there is no macOS.

DavidPiper · on April 30, 2024

And immeasurably better if there were no Microsoft Windows ;)

(Kidding, mostly.)

magicalhippo · on April 30, 2024

We make some B2B software running on Windows, integrating with customer systems. We get a lot of interesting files.

About a decade ago I wrote some utility code for reading files, where it'll try to detect BOM first, if not scan for invalid UTF-8 sequences. If none are found assume UTF-8 else assume Windows-1252. Worked well for us so far.

Still get the occasional flat file in Windows-1252 with one random field containing UTF-8, so some special handling is needed for those cases. But that's rare.

Fortunately we don't have to worry about normalization for the most part. If we're parsing then any delimiters will be one of the usual suspects and the rest data.

surfingdino · on April 30, 2024

Microsoft Windows is a source of many a headache for me as almost every other client I write code for has to deal with data created by humans using MS Office. Ordinary users could be excused, because they are not devs but even devs don't see a difference between ASCII and UTF-8 and continue to write code today as if it was 1986 and nobody needed to support accented characters.

morpheuskafka · on April 30, 2024

I got a ticket about some "folders with Chinese characters" showing up on an SMB share at work, my first thought was a Unicode issue and sure enough when you combine two UTF-8/ASCII A-z code points together as one UTF-16 code point, it will usually wind up in the CJK Common Ideographs range of Unicode. Some crappy software had evidently bypassed the appropriate Windows APIs and just directly wrote a C-style ASCII string onto the filesystem without realizing that NTFS is UTF-16.

dgellow · on April 30, 2024

I’ve been sharing it multiple times but I love it: WTF-16 spec https://simonsapin.github.io/wtf-8/#ill-formed-utf-16

0xEF · on April 30, 2024

Do you know of a resource that explains character encoding in greater detail? Just for my own curiosity. I am learning web development and boy, they brow beat UTF-8 upon us which okay, I'll make sure that call is in my meta data, but none bother to explain how or why we got to that point, or why it seems so splintered.

screwt · on May 2, 2024

(very late reply, but in case you see it)

This Joel On Software article [0] is a good starting point. Incredibly it's now over 20 years old so that makes me feel ancient! But still relevant today.

The suggestion that the web should just use utf-8 everywhere is largely true today. But we still have to interact with other software that may not use utf-8 for various legacy reasons - the CSV file example in the original article is a good example. Joel's article also mentions the solution discussed in the original article, i.e. use heuristics to deduce the encoding.

[0] https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...

int_19h · on April 30, 2024

> If you're writing low level code, you have to convert UTF-8 to UTF-16 and back.

It depends. If you're writing an app, just add the necessary incantation to your manifest, and all the narrow char APIs start talking UTF-8 to you.

For a library, yeah.

ezoe · on April 30, 2024

Don't be surprised such UTF-8 locale program will break so bad on Windows with default language set to Japanese.

int_19h · on April 30, 2024

Why would it break? If you just assume that the system codepage is UTF-8, then sure. If you specifically say in your manifest that you want UTF-8, then Windows (10+) will give you UTF-8 regardless of which locale it is:

https://learn.microsoft.com/en-us/windows/apps/design/global...

Comma2976 · on April 30, 2024

Some[1] would see breaking Windows as a feature

[1]Me, surely at least 1 other

mikhailfranco · on April 30, 2024

Some [1] may also consider working for any company/app that needs to display an emoji, to be a waste of at least one life (your life, and all your users' lives).

[1] Me, for sure.

Comma2976 · on April 30, 2024

Some[1] may have missed the point, emojis were never supported by some[2] in their projects and they consider them to be the mark of the beast[3]

[1]you may find him at ::1

[2]HEAD~2

[3]Front-end work

998244353 · on April 30, 2024

Non-technical users don't want to do that, and won't understand any of that. That's the unfortunate reality of developing software for people.

If Excel generates CSV files with some Windows-1234 encoding, then my "import data from CSV" function needs to handle that, in one way or another. A significant number of people generate CSV files from Excel and if these people upload these files into my application and I assume an incorrect encoding, they won't care that Microsoft is using obsolete or weird defaults. They will see it as a bug in my program and demand that I fix my software. Even if Excel offers them a choice in encoding, they won't understand any of that and more importantly they don't want to deal with that right now, they just want the thing to work.

berkes · on April 30, 2024

> then my "import data from CSV" function needs to handle that, in one way or another.

It doesn't. Well, maybe "another".

Your function or even app doesn't need to handle it. Here's what we did on a bookkeeping app: remove all the heuristics, edge cases, broken-csv-handling and validation from all the CVS ingress points.

Add one service that did one thing: receive a CSV, and normalize it. All ingress points could now assume a clean, valid, UTF8 input. It removed thousands of LOCs, hundreds of issues from the backlog. It made debugging easy and it greatly helped the clients even.

At some point, we offered their import runs for download, we added the [original name]_clean.csv our normalized versions. Got praise for that. Clients loved that, as they were often well aware of their internal mess of broken CVSs.

xorcist · on April 30, 2024

The point is to separate the cleaning stage from the import stage. Having a clean utf-8 csv makes debugging import issues so much easier. And there are several well working csv tools such as the Python one, that can not only detect character encodings but also record separators and various quotation idiosyncrasies that you also need to be aware of when dealing with Microsoft Office files. Other people have already thought long and hard about that stuff so you don't have to.

edflsafoiewq · on April 30, 2024

OTOH if no one ever pushes back against entropy, your great-grandkids will still be dealing with Windows-1234 problems in 2100.

vincnetas · on April 30, 2024

Hot to push back against entropy?

Could this work? Implement handling of ancient excel files in your SaaS product, but charge extra dollar for parsing legacy formats and provide information how to export correct files from excel next time :)

vaylian · on April 30, 2024

Plus, Excel really likes to use semicolons instead of commas for comma-separated-files. That's another idiosyncrasy that programmers need to take into account.

iggldiggl · on April 30, 2024

I think that happens mainly in those locales where a comma is the traditional decimal separator and the dot gets used for 1000s grouping.

chopin · on April 30, 2024

I'm German and I hate this. The icing on top: Excel separates function arguments with a semicolon in German locale, too. Got me some head scratching. Examples are separated with comma, iirc.

unclebucknasty · on April 30, 2024

>A significant number of people generate CSV files from Excel and if these people upload these files into my application and I assume an incorrect encoding...

What if you just gave them instructions?

carlosjobim · on April 30, 2024

I think most people would rather get in their car, drive to the ocean, board a ship and join the navy, and come back after a few years abroad - than following some instructions on how to use the computer.

Joker_vD · on April 30, 2024

Joining the navy and then be obliged to follow instructions on so many things you'd normally never even be ordered to do (because now you're in the Navy, sailor! and there are three ways to do any thing: the right way, the wrong way, and the Navy way, and guess which way we use here in the Navy?), just to not follow some instructions on how to use the computer... is something I can actually imagine some people doing.

unclebucknasty · on April 30, 2024

Pretty funny, and probably some truth to that being a user sentiment.

OTOH, they have to follow some process to use the software. For just the CSV export, they already have to ensure column orders, values, formats, maybe headers, etc. Selecting UTF-8 from a dropdown seems like the easy part.

lolc · on April 30, 2024

> they already have to ensure column orders, values, formats, maybe headers, etc.

My accrued compatibility shims disagree.

unclebucknasty · on April 30, 2024

Sorry to hear that. You have my sympathies.

nkrisc · on April 30, 2024

They’ll probably switch to the application that does it for them and just works, instead of the one telling them to do something they don’t really understand.

unclebucknasty · on April 30, 2024

Selecting UTF-8 from a dropdown on the export doesn't seem too onerous an ask. If that's the differentiator between you and your competitors, then you might have bigger problems.

nkrisc · on April 30, 2024

I have sat in on and observed many in-person usability sessions with various applications and websites.

In can tell you right now anyone reading this site is in a completely different universe of tech competency than the general public, and even professionals who aren’t tech-focused.

You would have lost many of them simply with then jumble of letters “UTF-8”.

jimmaswell · on April 30, 2024

I hope someday, with improved public education, we can have better users. Unlikely with Gen Z apparently having worse computer skills than the previous generation..

ghnws · on April 30, 2024

Everyone knowing about character encodings sounds like a lot of time wasted.

Gen Z has skills in what is available and popular today, just like the previous generation had.

unclebucknasty · on April 30, 2024

Yeah, I started writing software a "while" ago. I've encountered a handful of users (and charsets) in my time.

With CSV exports, there's already a good bit of training you have to do WRT the column layout, file format, headers (or not), cell value formats, etc. There's far more lift involved in training users there than ensuring they select "UTF-8" when they select CSV.

And, there's really nothing technical about it, as they don't need to understand what "UTF-8" actually means any more than they need to understand what "CSV" stand for. It's a simple ask in a list of asks.

I weigh this against having every developer now believe they have to check/convert every character set, which can be unreliable and produce garbage, BTW. And, speaking of garbage, there are some encodings that you won't be able to convert in any case, so it would be technically impossible to preserve data integrity without pushing some requirement on the source up to the user.

So, it's about tradeoffs. And, again, if asking users to choose UTF-8 is the difference between customers choosing your app or a competitor's, then you probably need to be more worried about that than your charset encoding.

EraYaN · on April 30, 2024

Hell we just accept xlsx and xls natively. It sucks but it solves many issues. Also causes some, but you'll never get it out of users' heads that Excel files are not a data exchange format.

unclebucknasty · on April 30, 2024

Definitely helps. Should be able to detect the charset from the Excel metadata, at least.

Of course, if that charset uses code points not available in your app's native charset, then you're kind of back to square one (unless your use case tolerates garbled or missing data).

ghnws · on April 30, 2024

Having worked quite a bit in b2b applications, that is exactly the kind of thing that makes customers consider other options.

unclebucknasty · on April 30, 2024

Interesting. My experience is that business users are much more teachable and mindful than consumers, especially when working with data.

zarzavat · on April 29, 2024

Agreed. Continuing to support other encodings is like insisting that cars should continue to have cassette tape players.

It’s much easier to tell the people with old cassette tapes to rip them, rather than try to put a tape player in every car.

fl7305 · on April 29, 2024

> It’s much easier to tell the people with old cassette tapes to rip them

I assume you mean "rip them", as in transcode to a different format?

In that case, you need a tool that takes the old input format(s) and convert them to the new format.

For text files, you'd need a tool that takes the old text files with various encodings and converts them to UTF-8.

Isn't the point of the article to describe how an engineer would create such a tool?

jcranmer · on April 29, 2024

> Isn't the point of the article to describe how an engineer would create such a tool?

Honestly, no, because the tool that it's suggesting how to write isn't one that will even come close to doing a good job.

If you want to write such a tool, the first thing you need to do is to understand what the correct answer is. And to do that, you need to sample your input set to figure out what the correct answer should be for several inputs where it matters. There's unfortunately no easy way to avoid that work; universal charset detection isn't really a thing that works all that well.

fl7305 · on April 29, 2024

I agree 100% on the technical issues.

But the point of the article is not the same thing as how well it achieved its goals.

javajosh · on April 30, 2024

>universal charset detection isn't really a thing that works all that well.

This seems like something LLMs would be good at. A mundane use of them, but I bet they'd be really good at determining that the input has the wrong encoding. Then the program would iterate through encodings, from most probable to least, and select the one that the LLM likes the most. Granted, this means your tool will be 1GB or more. But hey, thems the breaks.

fl7305 · on April 30, 2024

Yeah, that could be an interesting use of LLMs. It could at least tell you which languages might be present in the input text.

In the 1980s, we had a version of 7-bit ASCII in Sweden where the three extra Swedish vowels "åäö" were represented by "{}|".

So what might look like regular US 7-bit ASCII should be interpreted as the Swedish version if the text is in Swedish with "{}|" where "åäö" normally goes.

extraduder_ire · on April 30, 2024

I'm glad that didn't stick around, like ¥/₩ being used for directory separators like \. I can't imagine trying to read source code with those substitutions.

0xffff2 · on April 29, 2024

A tool whose purpose is to transcode should be asking the user to select explicit input and output formats, not guessing.

fl7305 · on April 29, 2024

> should be asking the user to select explicit input and output formats

It depends on the requirements.

If you're hired by a company to convert millions of old textfiles, they might want you to do it as well as possible using heuristics without any human input as a starting point.

tempaccount1234 · on April 30, 2024

Unless the input is such garbage that asking the user is pointless (like a lot of web sources) and fixing manually is too time expensive.

It becomes easier to just use https://pypi.org/project/ftfy/ on the input.

taneq · on April 30, 2024

No you don’t, they (the ones with the piles of old mixtapes) do.

fl7305 · on April 30, 2024

I'm an engineer. I'm sometimes hired by they to create tools like that.

gwervc · on April 29, 2024

UTF-8 uses 50% more bytes than UFT-16 to encode Chinese or Japanese texts.

arp242 · on April 29, 2024

Only with "pure" CJK text in a flat text file; for most real-world situations you'll have enough ASCII text that UTF-8 will be smaller: HMTL/XML tags, email headers, things like that. I did some tests a few years back, and wasn't really able to come up with a real-world situation where UTF-16 is smaller. I'm sure some situations exist, but by and large, CJK users are better off with UTF-8.

kijin · on April 30, 2024

Yep. I'm a heavy user of CJK languages and I don't give a damn about the slightly increased plaintext storage. Give me UTF-8 any day, every day. Legacy two-byte encodings can't represent all of the historical glyphs anyway, so there's no room for nationalist crap here.

shinalin · on April 30, 2024

Well, it's great that you did some tests a few years back, but I'm not sure how that qualifies you to make such a sweeping generalization about CJK text encoding. It's easy to dismiss UTF-16's benefits when you're only looking at a narrow slice of the real world, ignoring the vast amounts of pure CJK literature, historical archives, and user-generated content out there.

jiggawatts · on April 30, 2024

Databases tend to pack rows efficiently enough that there are scenarios where this is noticeable.

Cloudef · on April 30, 2024

Only for dense CJK text, and even then if you compress the difference goes away https://utf8everywhere.org/#asian

bawolff · on April 29, 2024

Which basically never matters and in any case where it actually does, gzip will make it equal again.

iraqmtpizza · on April 29, 2024

zip-then-encrypt leaks information about the plaintext. if it's life or death, better not to compress at all

bawolff · on April 30, 2024

Only when the attacker can choose part of the plaintext and do the same thing over and over again with different chosen plaintexts to compare results.

Yes, there are scenarios where that matters. However the vast majority of usecases of utf-8 don't fit that or even use encryption at all.

iraqmtpizza · on April 30, 2024

That is not the only way. There are other ways of knowing partial contents of files and changes to files, depending on the situation. If the document is a known form in which one of five boxes is checked by the sender, it's probably not hard to rule out certain selections based on the ciphertext length, if not pin down the contents exactly.

bawolff · on April 30, 2024

I'm not sure i entirely understand your example (if there are 5 checkboxes and 1 checked, presumably length would be the same regardless which one of those are checked). However to your broader point, i agree there exist scenarios along those lines (e.g. fingerprinting known communication based on length), however most of them apply even better when not using compression.

iraqmtpizza · on April 30, 2024

The checkbox example is completely plausible. There is no guarantee that all checkboxes lead to the same number of bytes changed in the file when checked. What if the format makes a note of the page number wherever a checkbox is checked? 1X could be two bytes and 15X would be three.

And even if the format only stored the checkbox states as a single bit each (unlikely), compression algorithms don't care. They will behave differently on different byte sequences, which can easily lead to a difference in output length.

Also, it's already been done with voice calls with no attacker-controlled data: https://web.archive.org/web/20080901185111/https://technolog...

zrm · on April 30, 2024

The attack you're referring to is not specific to compression. It's the same class of attack that can reveal keystrokes over older versions of ssh based on packet size and timing, even on uncompressed connections. Conversely, fixed-bitrate voice streams don't have the same vulnerability as variable-bitrate encodings even though they're still compressed.

The version of your checkbox example which is vulnerable without any formal data compression is when the checkbox is encoded in a field that is only included or changes in length if the value isn't the default, common in uncompressed variable-length encodings like JSON.

iraqmtpizza · on April 30, 2024

I'm sure that the people getting hacked care deeply about whether the attack they suffered was sui generis.

Also, zip/deflate etc was not designed to eliminate side channel leakage. Some compression schemes obviously (with padding) can mitigate leaks, but it has to be done deliberately

zrm · on April 30, 2024

Any of it has to be done deliberately. The length of the data reveals something about its contents whether it's compressed or not.

The special concern with compression is when attacker-controlled data is compressed against secret data because then the attacker can measure the length multiple times and deduce the secret based not just on the length but on how the length changes when the secret is constant and the attacker-controlled data varies. This can be mitigated with random padding (makes the attack take many times more iterations because it now requires statistical sampling) or prevented by compressing the sensitive data and attacker-controlled data separately.

bawolff · on April 30, 2024

If your example needs additional assumptions to be a relavent example then you should probably state them when you bring up the example.

iraqmtpizza · on May 1, 2024

like what lol

thwarted · on April 30, 2024

Encryption is completely unrelated to the task at hand, which is text encoding and compressing, and text encoding is not encryption.

tnmom · on April 29, 2024

Huh, never heard that before. Does it leak more information than just encrypting without zipping? Struggling to imagine how this attack works.

Jach · on April 30, 2024

It's an extension of the chosen-plaintext attack, and so requires the attacker to be able to send custom text that they know is in the encrypted payload. If the unencrypted payload is "our-secret-data :::: some user specified text", then the attacker can eventually determine the contents of our-secret-data by observing how the size of the encrypted response changes as they change the text when the compression step matches up with a part of the secret data. It can be defeated by adding random-length padding after compression and before the encryption step, though.

bawolff · on April 30, 2024

Essentially if you zip something, repeated text will be deduplicated.

For example "FooFoo" will be smaller than "FooBar" since there is a repeated pattern in the first one.

The attacker can look at the file size and make guesses about how repetitive the text is if they know what the uncompressed or normal size is.

This gets more powerful if the attacker can insert some of their own plaintext.

For example if the plaintext is "Foo" and the attacker inserts "Fo" (giving "FooFo") the result will be smaller than if they inserted zq where there is no pattern. By making lots of guesses the attacker can figure out the secret part of the text a little bit at a time just by observing the size of the ciphertext after inserting different guesses.

iraqmtpizza · on April 29, 2024

Encrypting without zipping doesn't leak any information about the content. You can't rule out certain byte sequences (other than by total length) just by looking at the ciphertext length.

If "oui" compresses to two bytes and "non" compresses to one byte, and then you go over them with a stream cipher, which is which:

A: ;

B: *&

lisper · on April 30, 2024

This has nothing to do with compression. If you use "yes" and "no" instead of "oui" and "non" (which just happen to be three characters each) and you compress "yes" to "T" and "no" to "F" then the uncompressed text will be the leaky one.

stouset · on April 30, 2024

It’s an example meant to prove the idea.

lisper · on April 30, 2024

Yes, and my example was an example meant to prove the opposite idea. The point is that it is irrelevant whether you compress or not. You can leak information either way.

iraqmtpizza · on April 30, 2024

I leak the length of my phone call and you leak:

1. the length of your phone call; and

2. what language you were speaking; oh and

3. half the words you said

(i.e. pwned)

https://web.archive.org/web/20080901185111/https://technolog...

lisper · on April 30, 2024

> you leak [a bunch of stuff]

How? Remember, the uncompressed text gets encrypted too.

iraqmtpizza · on April 30, 2024

It's in the article if you would bother to read it LOL. "simply measuring the size of packets without decoding them can identify whole words and phrases with a high rate of accuracy . . . [the researchers] can search for chosen phrases within the encrypted data"

lisper · on April 30, 2024

Ah.

That article is about voice calls. Totally different topic. Nothing to do with UTF-8.

BalinKing · on April 30, 2024

Cryptography noob here: I'm confused by "Encrypting without zipping doesn't leak any information about the content." Logically speaking, if we compress first and therefore "the content" will now refer to "the zipped content", doesn't this mean we still can't get any useful information?

thadt · on April 30, 2024

Not OP, but 'zipping and encrypting' one thing (a file for example) does not leak information by itself. The problem comes when an adversary is able to see the length of your encrypted data, and then can see how that length changes over time - especially if the attacker can control part of the input fed to the compressor.

So if you compressed the string "Bob likes yams" and I could convince you to append a string to it and compress again, then I could see how much the compressed length changed.

If the string I gave you was something already in your data then the string would compress more than it would if the string I gave you was not already in your data - "Bob likes yams and potatoes" will be larger than "Bob likes yams likes Bob".

If the only thing I can see about your data is the length and how it changes under compression - and I can get you to compress that along with data that I hand to you - then eventually I can learn the secret parts of your data.

bawolff · on April 30, 2024

Encryption generally leaks the size of the plaintext.

This is true in both the compressed and non-compressed case. However with compression the size of the plaintext depends on the contents, so the leak of the size can matter more than when not using compression.

Even without compression this can matter sometimes. Imagine compressing "yes" vs "no".

BalinKing · on April 30, 2024

> Encryption generally leaks the size of the plaintext.

Ah, I see. Naïvely, this seems like a really bad thing for an encryption algorithm to do—is there no way around it? Like, why is encryption different from hashing in this regard?

bawolff · on April 30, 2024

There are methods, but they are generally very inefficient bandwidth wise in the general case. The general approach is to add extra text (pad) so that all messages are a fixed size (or e.g. some power of 2). The higher the fixed size is, the less information is leaked and the less efficient it is. E.g. if you pad to 64mb but need to transmit a 1mb message, that is 63mb of extra data to transmit.

Part of the problem (afaik) is we lack good math tools to analyze the trade offs of different padding size vs how much extra privacy they provide. This makes it hard to reason about how much padding is "enough".

Another approach is adding a random amount of padding. This can be defeated if you can force the victim to resend messages (which you then average out the size of).

Hashing is different because you don't have to reconstruct the message from the hash. With encryption the recipient needs to decrypt the message eventually and get the original back. However there is no way to transmit (a maximally compressed) message in less space then it takes up.

There are special cases where this doesn't apply e.g. if you have a fixed transmission schedule where you send a sprcific number of bytes on a specific agreed upon schedule.

stouset · on April 30, 2024

Yes, of course it leaks more information than encryption without compression, because that’s just encryption which doesn’t leak anything.

In an enormous number of real world cases adversaries can end up including attacker-controller input alongside secret data. In that case you can guess at secret data and if you guess correctly, you get smaller compressed output. But even without that, imagine the worst case: a 1TB file that compresses to a handful of bytes. Pretty clearly the overwhelming majority of the text is just duplicate bytes. That’s information which is leaked.

kazinator · on April 30, 2024

That's a small price to pay for:

- not having to worry about byte order;

- and cruft like surrogate pairs;

- being able to pass the text through 8 bit string representations and even manipulate it as 8 bit in certain useful ways

Add bitmapped graphics to any text, and it will dominate the size.

Size of text is just noise in the world of streaming video, terabyte drives, fiber to the home ...

ryandrake · on April 29, 2024

> We don't go "oh that input that's supposed to be json? It looks like a malformed csv file, let's silently have a go at fixing that up for you". Or at least we shouldn't, some software probably does.

What ever happened to the Robustness Principle[1]? I think the entire comment section of this article has forgotten it. IMO the best software accepts many formats and "deals with it," or at least attempts to, rather than just exiting with "Hahah, Error 19923 Wrong Input Format. Try again, loser."

1: https://en.wikipedia.org/wiki/Robustness_principle

jerf · on April 29, 2024

We collectively discovered that we were underestimating the long term costs, by a lot, so its lustre has faded. This is in some sense relatively recent, so word is still getting around, as the programming world does not move as quickly as it fancies itself to.

If you'd like to see why, read the HTML 5 parsing portion of the spec. Slowly and carefully. Try to understand what is going on and why. A skim will not reveal the issue. You will come to a much greater understanding of the problem. Some study of what had happened when we tried to upgrade TCP (not the 4->6 transition, that's its own thing) and why the only two protocols that can practically exist on the Internet anymore are TCP and UDP may also be of interest.

fanf2 · on April 29, 2024

RFC 9413 “maintaining robust protocols” https://datatracker.ietf.org/doc/rfc9413/ was originally titled “The Harmful Consequences of the Robustness Principle (draft-thomson-postel-was-wrong)” https://datatracker.ietf.org/doc/draft-thomson-postel-was-wr... and it has several examples

riffraff · on April 30, 2024

Doesn't HTML5 prove that such an approach was more effective tho?

HTML was a massive success, while at the same time when we tried having a strict spec with XHTML, we failed miserably.

I mean, the purist in me wants strictness but it seems lousyness wins.

kelnos · on April 30, 2024

I think the particular domain matters.

The explosion of the web happened in no small part because of how easy it was to write some HTML and get a basic, working webpage out of it. If you nested some tags the wrong way and the browser just put up an error page, rather than doing a (usually) pretty good job figuring out what you actually meant, people would get frustrated faster and not bother with it at all.

But imagine if our C/C++/Java/Rust/Go/etc. compilers were like "syntax error, but ehhhhh you probably meant to put a closing brace there, so let's just pretend you did". That would be a nightmare of bugs and security issues.

The difficulty in drawing a line in the sand and sticking to the spec, though, is that of user blame. Let's say you implement a spec perfectly -- even if you are the originator of the spec -- and then someone comes along and builds something of their own that writes out files that don't conform to the spec. Your software throws up an error and says "invalid file", but the other piece of software can read it back in just fine. Users don't know or care about specifications; they just know that your software "doesn't work" for the files they have, and the other software does. If you try to tell them that the file is bad, and the other software has a bug, they really won't care.

astine · on April 30, 2024

But imagine if our C/C++/Java/Rust/Go/etc. compilers were like "syntax error, but ehhhhh you probably meant to put a closing brace there, so let's just pretend you did". That would be a nightmare of bugs and security issues.

That's how Perl ended up the way it is.

AnthonyMouse · on April 30, 2024

A possible solution to this is for large organizations to be intransigent about standards compliance. If your personal mail server rejects mail that isn't well-formed, you're just being a masochist because nobody is going to change for you. If Google does it, everybody else is going to fix their stuff because otherwise they can't send to gmail.

jerf · on April 30, 2024

It proves that engineering is hard and that everything has costs and benefits, and you can't make them go away by just ignoring them. It turns out that "being robust" had a lot of costs we didn't see.

It also shows the harmfulness of binary black and white thinking in engineering. There are choices other than "just let everyone do whatever and hope all the different things picking up the pieces do it in more or less the same way" and "rigidly specify a spec and blow up the universe at the slightest deviation". Both of those easy-to-specify choices have excessive costs. There is no escape from hard design tasks. XHTML may always have been doomed to fail, but that is not to say that HTML had to be allowed to be as loosey-goosey as it is, either.

Had a gradient of failure been introduced rather than a rigid rock wall, things very likely wouldn't have gotten as badly out of hand as they did. If, for instance, a screwed up table was specified to deliberately render in a very aesthetically unappealing manner, but not crash the entire page the way XHTML did, people would have not come to depend so much on HTML being sloppy. The resulting broken page would still be somewhat usable, but there would have been motivation to fix it, rather than the world we actually live in where it all just seemed to work.

tomsmeding · on April 29, 2024

Can you give a hint as to what the issue is that one should find reading a portion of the HTML 5 spec? Or is it genuinely unexplainable without experiencing something first-hand?

fl7305 · on April 29, 2024

>>> the best software accepts many formats and "deals with it,"

>> We collectively discovered that we were underestimating the long term costs, by a lot, so its lustre has faded.

>> If you'd like to see why, read the HTML 5 parsing portion of the spec.

> Can you give a hint as to what the issue is that one should find reading a portion of the HTML 5 spec

I think the point was that the HTML 5 spec tries to parse all kinds of weird input instead of drawing a line in the sand and forcing the input to follow a simple format?

Spivak · on April 30, 2024

It's ungodly complicated to be sure but websites from 30 years ago also still render unmodified so it's hard to say it's bad all of the time.

jerf · on April 30, 2024

It isn't bad. In fact it's quite good. But it is very much a case of closing the barn door after the animals got out. You see in the standard that the effort was put to corral them back in, and I'm very glad they did, but it certainly was not free.

travisb · on April 29, 2024

Among other things, security became a concern.

Being lenient is all well and good when the consequences are mild. When the consequences of misinterpreting or interpreting differently to a second implementation becomes costly, such as a security exploit, then the Robustness Principle becomes less obviously a win.

It's important to understand that every implementation will try to fix-up formatting problems in their own way unique to their particular implementation. From that you get various desync or reinterpretation attacks (eg. HTTP request smuggling).

thfuran · on April 29, 2024

It was a horrible idea. The real robustness principle is "Follow the spec".

nomel · on April 29, 2024

I've tried telling users "sorry, your file isn't to spec", and they say "but it works with <competitor>", and that ideology flies right out the window, along with their money.

ryandrake · on April 30, 2024

Exactly. "Accepting and trying" is how a lot of popular software won their market. Look at HN's favorite media player, VLC. In the past, media player software were horrible, refusing to play all but the most tightly constrained set of allowed containers/codecs. I remember spending the early 2000s trying to get Windows Media Player to play MPEGs by downloading codec packs and trying to cast secret spells into the Windows Registry. Yuck! Then along comes VLC which accepts whatever you throw at it, and that software is basically everywhere now. You can throw line noise at VLC and it will try its best to play something!

silotis · on April 30, 2024

The trick is to enforce conformance right from the start in the first implementation of a format. Shipping a product that doesn't interop with the existing tools is a non-starter so the devs will have to fix their shit first.

As you say, unfortunately the genie cannot be put back in the bottle for formats that already have defective implementations in the wild.

riffraff · on April 30, 2024

The problem is that unless you restrict everyone else you'll get your own product not to interact with *new" tools too.

E.g. you produce valid .wat files, but my software which also outputs those has some bits screwed up.

My program can read both .wat but yours can't, but I have 5% market share.

Your users complain they sometimes receive files your software can't read while the competitor can. Do you tell them "well that file is invalid, tell whoever sent it to you to change the software they use"?

The genie can't stay in the bottle unless you have some sort of certification authority and even that may not be enough (see USB)

thfuran · on April 30, 2024

But the more every actor aims to follow the spec rather than reproduce everyone else's bugs, the less quickly the ecosystem devolves into a horrid tangled mess that's nigh impossible for a new entrant.

quectophoton · on April 30, 2024

"So you're saying that I can have an advantage over everyone else if I implement all the spec features, plus one extra feature that adds convenience to our users?

And you're saying that by doing this, not only do I gain an advantage over the existing competition, but I also make it more difficult for more competitors to appear?

[takes notes]"

thfuran · on May 1, 2024

Microsoft is a few decades ahead of you there.

jimjimjim · on April 30, 2024

This is the answer.

The customer does not know. They just want it to work. They may be using something that someone else gave them. The original source system of the file may not be changeable. But most importantly, their boss just wants it to work. or else.

thfuran · on April 30, 2024

People will pay you to do all kinds of terrible not-robust things.

Joker_vD · on April 30, 2024

Yes, and that's why Postel's law is more of an empirical observation (a law of nature, if you will) on which software survives and which doesn't. You may dislike it but that won't make it go away.

nomel · on May 1, 2024

I see it as the opposite, actually. They will pay you for a robust product: one that works for them. They have no care for the technical minutiae of your implementation, because they're just trying to do actually interesting things, with the help of your product. This is the customer perspective.

pixl97 · on April 30, 2024

That's when you ask to see their file, then hack <competitors> system.

vidarh · on April 30, 2024

That is fine in contexts where a wrong guess does no harm.

But that is not always the case, and e.g. silently "fixing" text encoding issues can often corrupt the data if you get it wrong.

By all means offer options of you want, but if you do flag very clearly to the user that they're taking a risk of corrupting the data unless any errors are very apparent and trivial to undo.

david422 · on April 30, 2024

> "deals with it,"

This basically loses data integrity if it's wrong though.

You might want to do that with human input if it's helpful to the user - ie user enters a phone number and you strip dashes etc. But if it's machine to machine, it should just follow the spec.

tedunangst · on April 29, 2024

Seems like you've uploaded a jpeg. Let me OCR that into CSV for you. Hmm, no text found. Let's pass it to a multimodal LLM.

kelnos · on April 30, 2024

The article addresses this, that current thinking in many places is that the robustness principle / Postel's Law maybe wasn't the best idea.

If you reject malformed input, then the person who created it has to go back and fix it and try again. If you interpret malformed input the best you can (and get it right), then everyone else implementing the same thing in the future now also has to implement your heuristics and workarounds. The malformed input effectively becomes a part of the spec.

This is why HTML & CSS are the garbage dump they are today, and why different browsers still don't always display pages exactly alike. The reason HTML5 exists is because people finally just gave up and decided to standardize all the broken behavior that was floating around in the wild. Pre-HTML5, the web was an outright dumpster fire of browser compatibility issues (as opposed to the mere garbage dump we have today).

Anyway, it's not really important to try to convince you that Postel's Law is bad; what's important is that you know that many people are starting to think it's bad, and there's no longer any strong consensus that it was ever a good thing.

arp242 · on April 29, 2024

> What ever happened to the Robustness Principle

Bush hid the facts

cesarb · on April 30, 2024

> Bush hid the facts

For those who don't know that reference: https://en.wikipedia.org/wiki/Bush_hid_the_facts

(A text file containing only the ASCII bytes "Bush hid the facts", when opened in Windows Notepad, displays a sequence of CJK characters instead of the expected English sentence.)

lamontcg · on April 30, 2024

I've lived through dealing with non-UTF8 encoding issues and it was a truly gigantic pain in the ass. I'm much more on the side now of people who only want to deal with UTF8 and fully support software that tells any other encoding to go pound sand. The harder life gets for people who use other encodings (yes, particularly Microsoft) the more incentive they have to eventually get on board and stop costing everyone time and effort managing all this nonsense.

jppittma · on April 30, 2024

I think people have collectively decided that they want their programs stupid and predictable, rather than smart and unwieldy.

kazinator · on April 30, 2024

What happened to it:

https://en.wikipedia.org/wiki/Robustness_principle#Criticism

Postel's Law doesn't pass a software engineering smell test.

The idea that software should guess and repair bad inputs is deeply flawed. It is a security threat and a source of enshittification.

chuckadams · on April 30, 2024

> enshittification

I am so over this word.

kazinator · on April 30, 2024

I put a TODO item in my calendar for December 2024 to stop using it. :)

ykonstant · on April 30, 2024

The enshittification of enshittification was an inevitability.

fl7305 · on April 29, 2024

> they turn it into utf-8 using a standalone program first

I took the article to be for people who would be writing that "standalone program"?

I have certainly been in a position where I was the person who had to deal with input text files with unknown encodings. There was no-one else to hand off the problem to.

0xffff2 · on April 29, 2024

In that case, you should be explicitly asking the user what the input format is.

fl7305 · on April 29, 2024

> you should be explicitly asking the user what the input format is.

Me: "I'll make the program stop and ask the user."

Customer: "No. We have 10 million files. Start by using heuristics for batch processing."

Spivak · on April 30, 2024

And it's usually worse, we have 10 million files coming from sources neither we nor our users control.

Or "what do you mean ask them, they barely know what a jpeg is?!"

MichaelZuo · on April 30, 2024

There's someone that must be paying your invoices? Charge them extra for the extra work that needs to be done to sort things out.

Spivak · on April 30, 2024

??? The work to use one of the many encoding guessing tools https://github.com/Joungkyun/libchardet and then get it correct for almost every document?

You just look bad if you can't do what every other software is able to do. Charging for it takes that to another level.

MichaelZuo · on April 30, 2024

Then you don't have to worry about it since you won't get the work in the future? Someone else, with this presumably correct software, will always be able to do it for less, faster, and at a higher quality.

That's how business works...

If such a business competitor doesn't exist, then yes charge extra, and actually do the work correctly.

Spivak · on April 30, 2024

Am I missing something here? The work is ingesting documents from uncontrolled sources that might not all be UTF-8 and handling them correctly. Using an encoding guessing tool is the means to do that. In practice since there are only a few widely-used encodings and they're not terribly ambiguous it means that everything just works and users happily use the software.

This isn't some theoretical thing, we do this at $dayjob right now not only guessing the encoding but the file-type as well so that we can make sense out of whatever garbage our users upload. Everything from malformed CSV exports form Excel to PDFs that are just JPEGs of scanned documents. It works, and it works well.

And of course it does, the files our users are handing to us work on their machines. They can open them up and look at them in whatever local software they produced them with, there's no excuse for us to be unable to do the same.

MichaelZuo · on April 30, 2024

Then you don't need to worry about it either way?

bobmcnamara · on April 30, 2024

Heckin no thank you!

The FCC ULS database records are stored in a combination of no fewer than three different encodings(1252, UTF8, and something else for a handful of German names) that vary per record.

When I brought this up they said something to the effect of: it's already unicodes it has tilde letters!

eythian · on April 30, 2024

I did once have a file that had UTF8, Windows-1252, MARC8, and VT100 (really) all mixed up in it. I think the data had gone through multiple migrations between software in its past.

I had write to my own "clean this as well as possible" thing, and it did a good enough job.

PaulDavisThe1st · on April 29, 2024

"I dunno, got it from a friend on a USB stick"

amarshall · on April 29, 2024

> some software probably does.

Browsers do, kind of https://mimesniff.spec.whatwg.org/#rules-for-identifying-an-...

SuperNinKenDo · on April 30, 2024

Not every encoding can make a round trip through Unicode without you writing ad hoc handling code for every single one. There's a number of reasons some of these are still in use and Unicode destroying information is one of them.

thaumasiotes · on April 30, 2024

> We don't go "oh that input that's supposed to be json? It looks like a malformed csv file, let's silently have a go at fixing that up for you". Or at least we shouldn't, some software probably does.

Browsers used to have a menu option to choose the encoding you wanted to use to decode the page.

In Firefox, that's been replaced by the magic option "Repair Text Encoding". There is no justification for this.

They seem to be in the process of disabling that option too:

> Note: On most modern pages, the Repair Text Encoding menu item will be greyed-out because character encoding changes are not supported.

( https://support.mozilla.org/en-US/kb/text-encoding-no-longer... )

This note is logical gibberish; encoding isn't something that has to be supported by the page. Decoding is a choice by the browser!

shiomiru · on April 30, 2024

It seems the decision was made in the name of security:

https://hsivonen.fi/no-encoding-menu/

> Supporting the specific manually-selectable encodings caused significant complexity in the HTML parser when trying to support the feature securely (i.e. not allowing certain encodings to be overridden). With the current approach, the parser needs to know of one flag to force chardetng, which the parser has to be able to run in other situations anyway, to run.

> Elaborate UI surface for a niche feature risks the whole feature getting removed

> Telemetry [...] suggested that users aren’t that good at choosing correctly manually.

In other words, it's trying to protect users from themselves by dumbing down the browser. (Never mind that people who know what they are doing have probably also turned off telemetry...)

thaumasiotes · on April 30, 2024

Wow. That writeup is insane.

>> Supporting the specific manually-selectable encodings caused significant complexity in the HTML parser when trying to support the feature securely (i.e. not allowing certain encodings to be overridden).

There's no explanation of why you'd want this, or why it's security-relevant.

(Farther down, there's a mention of self-XSS, which definitely isn't relevant.)

>> Elaborate UI surface for a niche feature risks the whole feature getting removed

They've already removed the whole feature. That was easier to do after they mostly disabled it, not harder.

>> Telemetry showed users making a selection from the menu when the encoding of the page being overridden had come from a previous selection from the menu.

That would be an example of "working as expected". The removal of the ability to do this is the problem that disabling the encoding menu causes! Under the old, correct approach, you'd guess what the encoding was until you got it right. Under the new approach, the browser guesses for you, and if the first guess is wrong, screw you.

n_plus_1_acc · on April 30, 2024

Probably because most websites now send a correct encoding header or meta Tag, so the user changing can only make it wrong. (Assuming no encoding header is wrong, which happens in reality)

layer8 · on April 30, 2024

It does happen a lot that old text content in non-UTF-8 encoding is mistakenly served explicitly marked as UTF-8. It is precisely in such circumstances that the Repair Text option is useful.

kelnos · on April 30, 2024

Solutions that require lots of unrelated people to start doing something a different way are not really solutions.

kstrauser · on April 29, 2024

If you give me a computer timestamp without a timezone, I can and will assume it's in UTC. It might not be, but if it's not and I process it as though it is, and the sender doesn't like the results, that's on them. I'm willing to spend approximately zero effort trying to guess what nonstandard thing they're trying to send me unless they're paying me or my company a whole lot of money, in which case I'll convert it to UTC upon import and continue on from there.

Same with UTF-8. Life's too short for bothering with anything else today. I'll deal with some weird janky encoding for the right price, but the first thing I'd do is convert it to UTF-8. Damned if I'm going to complicate the innards of my code with special case code paths for non-UTF-8.

If there were some inherent issue with UTF-8 that made it significantly worse than some other encoding for a given task, I'd be sympathetic to that explanation and wouldn't be such a pain in the neck about this. For instance, if it were the case that it did a bad job of encoding Mandarin or Urdu or Xhosa or Persian, and the people who use those languages strongly preferred to use something else, I'd understand. However, I've never heard a viable explanation for not using UTF-8 other than legacy software support, and if you want to continue to use something ancient and weird, it's on you to adapt it to the rest of the world because they're definitely not going to adapt the world to you.

hnick · on April 30, 2024

It depends on the domain. If you are writing calendar software, it is legitimate to have "floating time" i.e. your medication reminder is at 7pm every day, regardless of time zone, travel, or anything else.

Unfortunately Google and many other companies have decided UTC is the only way, so this causes issues with ICS files that use that format sometimes when they are generating their helpful popups in the GMail inbox.

kstrauser · on April 30, 2024

Those aren’t timestamps. They’re descriptions of how to select them.

hnick · on April 30, 2024

Oh yes true. Somehow I missed that important detail.

cesarb · on April 30, 2024

> If you are writing calendar software, it is legitimate to have "floating time" i.e. your medication reminder is at 7pm every day, regardless of time zone, travel, or anything else.

If you have to take medication (for instance, an antibiotic) every 24 hours, it must be taken at the same UTC hour, even if you took a train to a town in another timezone. Keeping the same local time even when the timezone changes would be wrong for that use case.

hnick · on April 30, 2024

If you're there for a while, you'll need to adapt anyway since your biorhythms will too. But there are plenty of other cases like a reminder to check something after dinner, or my standard wake-up alarm in the morning. Or if I plan to travel, book lunch at a nice place for 1pm, and put it in my calendar I just want it to be 1pm wherever I go, without caring about TZ changes.

Calendars, alarms, and reminders have some overlap here and floating time can be good for some cases.

kstrauser · on April 30, 2024

There are very few drugs where that's a requirement. Your kidneys and liver aren't smart enough to metabolize anything at precisely the same rate every day anyway.

kccqzy · on April 29, 2024

> For instance, if it were the case that it did a bad job of encoding Mandarin

I don't know if you picked this example on purpose, but using UTF-8 to encode Chinese is 50% larger than the old encoding (GB2312). I remember people cared about this like twenty years ago. I don't know of anyone that still cares about this encoding inefficiency. Any compression algorithm is able to remove such encoding inefficiency while using negligible CPU to decompress.

kstrauser · on April 29, 2024

That doesn't seem like the worst issue imaginable. I doubt there are too many cases where every byte counts, text uses a significant portion of the available space, and compression is unavailable or inefficient. If we were still cramming floppies full of text files destined for very slow computers, that'd be one thing. Web pages full of uncompressed text are still either so small that it's a moot point or so huge with JS, images, and fonts that the relative text size isn't that significant.

Which is all to say that you're right, but I can't imagine that it's more than a theoretical nuisance outside some extremely niche cases.

bawolff · on April 29, 2024

> Web pages full of uncompressed text

Are basically non existent. Almost all modern web servers transparently compress html. Sending uncompressed text over the wire is extremely rare.

iraqmtpizza · on April 29, 2024

They shouldn't be non-existent. Zip-then-encrypt is not secure due to information leakage.

EDIT: also, it's not safe—message length is dependent on the values of the plaintext bytes, period. i'm not saying don't live dangerously, i'm just saying live dangerously knowing

fanf2 · on April 29, 2024

The information leakage problem occurs when compression is done in the TLS layer, because then the compression context includes both headers (with cookies) and bodies (containing potentially attacker-controlled data). But if you do compression at the HTTP layer using its Transfer-Encoding then the compression context only covers the body, which is safe.

rocqua · on April 30, 2024

It can still leak data if attackers can get their input reflected. I.e. I send you a word, and then I get to observe a compressed and encrypted message including my word and sensitive data. If my word matches the sensitive data, the cyphertext will be smaller. Hence I can learn things about the cipgertext. That is no longer good encryption.

bawolff · on April 30, 2024

What you are talking about is generally referred to as the "BREACH" attack. While there may theoretically be scenarios where it is relavent, in practise it almost never is so the industry has largely decided to ignore it (its important to distinguish this from the CRIME attack which is about http headers instead of the response body which has a much higher liklihood of being exploitable while still being hard).

The reason its usually safe is that to exploit you need:

- a secret inside the html file

- the secret has to stay constant and cannot change (since it is adaptive attack. CSRF tokens and similar things usually change on every request so cannot be attacked)

- the attacker has to have a method to inject something into the html file and repeat it for different payloads

- the attacker has to be able to see how many bytes the response is (or some other side channel)

- the attacker is not one of the ends of the communication (no point to attack yourself)

Having all these requirements met is very unlikely.

kstrauser · on April 29, 2024

Do you often send raw bitmaps for the same reason?

iraqmtpizza · on April 30, 2024

Do you often get completely pwned and have your encrypted calls transcribed by people eating doughnuts because you thought it was safe to compress sensitive data before encrypting? https://web.archive.org/web/20080901185111/https://technolog...

gary_0 · on April 29, 2024

For Asian languages, UTF-8 is basically the same size as any other encoding when compressed[0] (and you should be using compression if you care about space) so in practice there is no data size advantage to using non-standard encodings.

[0] https://utf8everywhere.org/#asian

neild · on April 30, 2024

In addition, Chinese characters encode more information than English letters, so a text written in Chinese will generally consume fewer bytes than the same text in English even when using UTF-8.

(Consider: Horse is five letters, but 馬 is one character. Even at three bytes per character, Chinese wins.)

Panzer04 · on April 30, 2024

Presumably that derives from the overhead of encoding an english character as a full byte? Given there's only 26 characters normally, you could fit that into 5 bits instead, which funnily enough does actually line up with the chinese character encoding (5x5 vs 1x24).

kps · on April 30, 2024

Yes. It's the non-Latin alphabets that lose with either UTF-8 or UTF-16, compared with stateful ISO 2022 page switching.