In more technical jargon, it is about value spaces. All UTF-8 strings map to WTF...

the_mitsuhiko · on Nov 30, 2022

> I am having a hard time to understand how so many people prefer this as the outcome of a Web / polyglot / language neutral standard, especially since affected languages cannot change, and the problem is so trivially avoided, say with a boolean flag "this end is WTF-16, it's OK if the other end is as well" (otherwise use well-formed/UTF-8 semantics).

It's because the idea that languages "cannot change" does not appear to be true. UTF-8 is so widespread now that for languages changing the native string representation towards it has become an interesting proposition. Many modern languages (eg: Go and Rust) already picked UTF-8, others such as Swift changed over to it. Then there are implementations of languages like Python (PyPy) that changed their internal encoding even though that was a widespread assumption that it cannot work.

The web is also not WTF-16, JavaScript is and the web consists of more than just that. WTF-16 to WTF-16 is most likely becoming less and less a thing going forward except for legacy interfaces such as W APIs on Windows and even there it appears that UTF-8 on the codepage level is now strongly recommended.

To give you another example: I'm very interested in using AssemblyScript today to do data processing, but that actually is not all that easy because the data I need to process is in UTF-8. Now to use the string class in AssemblyScript I actually have to do a pointless data conversion to WTF-16 and back.

I would be majorly surprised if JavaScript doesn’t adopt UTF-8 at one point as well.

dcode · on Nov 30, 2022

I do understand the desire to switch all languages and systems to one encoding, of course. However, Switching a WTF-16 language to UTF-8 removes previously valid values from strings, then exchanging what's errors/mutation on Component boundaries right now with errors/mutation when using string APIs. Can't be done in a backwards-compatible way, and all these languages have a lot of existing code. If backwards compatibility is a goal (say when using a breadcrumbs mechanism as in Swift), one still ends up with WTF-8 underneath, which maps to WTF-16, but is not UTF-8. Hence why I think it's impossible, because the only way to pull this off is by replacing affected string APIs (and/or accepting that old APIs then throw or mutate). Likewise, I see a possible future where JS adopts breadcrumbs, but then with WTF-8 (and perhaps a well-formedness flag), not guaranteed UTF-8. In your use case, that would yield a fast-path if a string is well-formed, but still with the same old fallback. Plus, of course, that having a systems fast-path implies that there is a corresponding JS-interop slow-path (when using AS).

the_mitsuhiko · on Nov 30, 2022

PyPy uses utf-8 internally and it’s completely hidden from the user. That’s however possible because in Python there was always a UCS2/USC4 leak to the user code so you could never really rely on anything.

I expect other languages to make the switch sooner or later.

I do think though that this is not all that interesting for the issue here. WASI needs to pick some format and picking UTF-8 is fine. Roundtripping half broken UTF-16 is something that does not need preserving.

I think enforcing UTF-8 there won’t be much of an issue in practice.

dcode · on Nov 30, 2022

I guess we are about to find out whether there is substance to the precedents. My bet is on "what can go wrong, will go wrong", even more so on Web scale. Let's hope I'm wrong.