Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

By this argument UTF8 can't exist. And yet here it is.

PS: I never said 100% forward/backward compatible with all variant at the same time and without any noticeable artifact. I meant compatible in a non blocking way.



What are you talking about? UTF8 is a single well-defined specification, and detecting that data is definitely not UTF8 is trivial.


And yet it is forward/backward compatible with ASCII and non blocking against all it's ill defined variants.


ASCII was well defined, CSV was not. Therefore they could take the highest bit, which they could know that was unused per the ASCII spec, and use that to encode their extra UTF-8 information.

Also UTF-8/ascii compatibility is unidirectional. A tool that understands ASCII is going to print nonsense when it encounters emoji or whatever in UTF-8. Even the idea that tools that only understand ASCII won't mangle UTF-8 is limited - sure dumb passthroughs are fine, but if it manipulates the text at all, then you're out of luck - what does it mean to uppercase the first byte of a flag emoji?


To be fair, there is basically no way to manipulate arbitrary text at all without mangling it, UTF-8-aware or not. What does it mean to take the first 7 characters of a UTF-8 string which might contain combinator characters and left-to-right special chars? What if the text uses special shaping chars, such as arranging hieroglyphs in cartouches? You basically need a text-rendering aware library to manipulate arbitrary strings.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: