By this argument UTF8 can't exist. And yet here it is. PS: I never said 100% for...

masklinn · on Aug 18, 2021

What are you talking about? UTF8 is a single well-defined specification, and detecting that data is definitely not UTF8 is trivial.

Twisell · on Aug 18, 2021

And yet it is forward/backward compatible with ASCII and non blocking against all it's ill defined variants.

Macha · on Aug 18, 2021

ASCII was well defined, CSV was not. Therefore they could take the highest bit, which they could know that was unused per the ASCII spec, and use that to encode their extra UTF-8 information.

Also UTF-8/ascii compatibility is unidirectional. A tool that understands ASCII is going to print nonsense when it encounters emoji or whatever in UTF-8. Even the idea that tools that only understand ASCII won't mangle UTF-8 is limited - sure dumb passthroughs are fine, but if it manipulates the text at all, then you're out of luck - what does it mean to uppercase the first byte of a flag emoji?

tsimionescu · on Aug 18, 2021

To be fair, there is basically no way to manipulate arbitrary text at all without mangling it, UTF-8-aware or not. What does it mean to take the first 7 characters of a UTF-8 string which might contain combinator characters and left-to-right special chars? What if the text uses special shaping chars, such as arranging hieroglyphs in cartouches? You basically need a text-rendering aware library to manipulate arbitrary strings.