My name is officially spelled as Léon. The letter e with an acute accent causes ...

abruzzi · on Nov 9, 2019

Reminds me of the ode to a shipping label:

http://i.imgur.com/4J7Il0m.jpg

What these things all reinforce is that a lot of programmers take text encoding as a given, and don’t realize all the potential places for errors to sneak in.

Sirened · on Nov 10, 2019

Could be a fun way to hunt for buffer overflows on internal shipping services. Just fill out the sender name field to just "óóóóóóóóóóóóóóóóóóóóóóó" and let it expand. If the parcel arrives, not vulnerable. If the packet doesn't arrive, you've found a vulnerability... somewhere...

perl4ever · on Nov 9, 2019

I wouldn't say an accent "causes" UTF-8 encoding issues. If acute accents are a problem, then UTF-8 handling has completely failed.

It is amazing to me where I see failed encoding like that. For instance, many SEC filings and job ads for tech companies. I mean, I feel like I'm expected to spell things correctly on my resume and emails at work...

deathanatos · on Nov 9, 2019

> If you interpret the UTF-8 é (0xC3A9) as ASCII it becomes Ã (0xC3) + © (0xA9)

As latin1 (ISO-8859-1) or Win-1252; ASCII doesn't have either Ã or ©.

latin1 is the default for text, including HTML, if you don't specify in protocols such as HTTP (modulo some stupidity from the WHATWG where it might be Win-1252 instead) and Windows-1252 is the default encoding in Windows in the USA (at least, prior to the Unicode APIs being added. The old APIs probably still exist though…). So these codecs pop up a lot in places where people who don't know what they're doing end up touching text.

terinjokes · on Nov 9, 2019

The WHATWG HTML spec requires UTF-8 for conforming documents and scripts [WHATWG 4.2.5.4]. In both HTML specs, charset declarations, if provided, must be UTF-8 [4.2.5].

If the transport, content-type, lack of charset declaration, and sniffing fail to determine an encoding, both specs use defaults based on the configured locale, for English that's windows-1252 [WHATWG: 12.2.3.2 W2C: 8.2.2.2]. latin1/ISO-8859-1 is prohibited. [WHATWG: 12.2.3.3 W3C: 8.2.2.3].

perl4ever · on Nov 9, 2019

I ran across some code once for descrambling data that had been incorrectly processed like that, which I found common in legal documents. It's an interesting problem, because strictly speaking, it's lossy, but you can use probabilities to figure out something plausible. You can decode/encode one thing as another, or you can decode/encode multiple times...

OkGoDoIt · on Nov 9, 2019

Any chance you have a link? I’ve had implement solutions to this myself and it’s very tedious. If someone has built a more complete solution I would love to just use that instead

bulatb · on Nov 9, 2019

This HN thread has some links and discussion: https://news.ycombinator.com/item?id=16103356

perl4ever · on Nov 10, 2019

That might be what I'm remembering; then again, I don't really do Python, so maybe it was something else. I doubt it was anything better than the link above, regardless.

johnp_ · on Nov 9, 2019

You could try inputting your name as [Latin Small Letter E][Combining Acute Accent]:

e◌́ => é

Which should keep the `e` intact, while the combining acute accent (0xCC 0x81) may "only" get converted to a `Ì` which may be stripped. 0x81 is undefined in Windows-1252, so I have no idea what would happen to that, but probably be stripped as well, keeping just Leon.

jcranmer · on Nov 10, 2019

Unless someone decides to NFC-normalize the text along the way. And it's generally agreed that text should be normalized with NFC, although there is often a fierce debate about who should do it ("not me").

gmueckl · on Nov 9, 2019

Reminds me of the times when Amazon failed to reproduce the ü in my last name on their shippig labels. They consistently printed the UTF-8 encoded character interpreted as 8 bit ASCII sequence. That bug was present for a couple of years.