Airlines make it worse, because they strip both characters during sanity checking, so my name comes out as 'Lon', which has caused me problems a couple of times as the name on my passport did not match the name on the ticket.
What these things all reinforce is that a lot of programmers take text encoding as a given, and don’t realize all the potential places for errors to sneak in.
Could be a fun way to hunt for buffer overflows on internal shipping services. Just fill out the sender name field to just "óóóóóóóóóóóóóóóóóóóóóóó" and let it expand. If the parcel arrives, not vulnerable. If the packet doesn't arrive, you've found a vulnerability... somewhere...
I wouldn't say an accent "causes" UTF-8 encoding issues. If acute accents are a problem, then UTF-8 handling has completely failed.
It is amazing to me where I see failed encoding like that. For instance, many SEC filings and job ads for tech companies. I mean, I feel like I'm expected to spell things correctly on my resume and emails at work...
latin1 is the default for text, including HTML, if you don't specify in protocols such as HTTP (modulo some stupidity from the WHATWG where it might be Win-1252 instead) and Windows-1252 is the default encoding in Windows in the USA (at least, prior to the Unicode APIs being added. The old APIs probably still exist though…). So these codecs pop up a lot in places where people who don't know what they're doing end up touching text.
The WHATWG HTML spec requires UTF-8 for conforming documents and scripts [WHATWG 4.2.5.4]. In both HTML specs, charset declarations, if provided, must be UTF-8 [4.2.5].
If the transport, content-type, lack of charset declaration, and sniffing fail to determine an encoding, both specs use defaults based on the configured locale, for English that's windows-1252 [WHATWG: 12.2.3.2 W2C: 8.2.2.2]. latin1/ISO-8859-1 is prohibited. [WHATWG: 12.2.3.3 W3C: 8.2.2.3].
I ran across some code once for descrambling data that had been incorrectly processed like that, which I found common in legal documents. It's an interesting problem, because strictly speaking, it's lossy, but you can use probabilities to figure out something plausible. You can decode/encode one thing as another, or you can decode/encode multiple times...
Any chance you have a link? I’ve had implement solutions to this myself and it’s very tedious. If someone has built a more complete solution I would love to just use that instead
That might be what I'm remembering; then again, I don't really do Python, so maybe it was something else. I doubt it was anything better than the link above, regardless.
You could try inputting your name as [Latin Small Letter E][Combining Acute Accent]:
e◌́
=>
é
Which should keep the `e` intact, while the combining acute accent (0xCC 0x81) may "only" get converted to a `Ì` which may be stripped. 0x81 is undefined in Windows-1252, so I have no idea what would happen to that, but probably be stripped as well, keeping just Leon.
Unless someone decides to NFC-normalize the text along the way. And it's generally agreed that text should be normalized with NFC, although there is often a fierce debate about who should do it ("not me").
Reminds me of the times when Amazon failed to reproduce the ü in my last name on their shippig labels. They consistently printed the UTF-8 encoded character interpreted as 8 bit ASCII sequence. That bug was present for a couple of years.
The letter e with an acute accent causes all sorts of UTF-8 encoding issues with many services, not just airliners. If you interpret the UTF-8 é (0xC3A9) as ASCII it becomes à (0xC3) + © (0xA9), so my name often comes out as 'Léon'.
Airlines make it worse, because they strip both characters during sanity checking, so my name comes out as 'Lon', which has caused me problems a couple of times as the name on my passport did not match the name on the ticket.