In more technical jargon, it is about value spaces. All UTF-8 strings map to WTF-16 strings semantically (lists of Unicode Scalar Values are a subset of lists of Unicode Code Points), but some WTF-16 strings do not map to UTF-8 (Unicode Scalar Values exclude some Code Points). That's something UTF-8-based languages have to deal with anyway (on the Web, which is WTF-16, or any other mixed system), but it's odd that the expectation has become that even languages that map to each other incl. to JS/Web APIs now have to share a problem that does not exist in their native VMs for reasons. To emphasize: These languages just do what's perfectly fine in WebIDL, JSON, ECMAScript and their own language specifications.
let myString = "...";
let returnedMyString = someFunction(myString); // might or might not cross boundary
if (myString == returnedMyString) {
// sometimes silently false
}
I am having a hard time to understand how so many people prefer this as the outcome of a Web / polyglot / language neutral standard, especially since affected languages cannot change, and the problem is so trivially avoided, say with a boolean flag "this end is WTF-16, it's OK if the other end is as well" (otherwise use well-formed/UTF-8 semantics).
> I am having a hard time to understand how so many people prefer this as the outcome of a Web / polyglot / language neutral standard, especially since affected languages cannot change, and the problem is so trivially avoided, say with a boolean flag "this end is WTF-16, it's OK if the other end is as well" (otherwise use well-formed/UTF-8 semantics).
It's because the idea that languages "cannot change" does not appear to be true. UTF-8 is so widespread now that for languages changing the native string representation towards it has become an interesting proposition. Many modern languages (eg: Go and Rust) already picked UTF-8, others such as Swift changed over to it. Then there are implementations of languages like Python (PyPy) that changed their internal encoding even though that was a widespread assumption that it cannot work.
The web is also not WTF-16, JavaScript is and the web consists of more than just that. WTF-16 to WTF-16 is most likely becoming less and less a thing going forward except for legacy interfaces such as W APIs on Windows and even there it appears that UTF-8 on the codepage level is now strongly recommended.
To give you another example: I'm very interested in using AssemblyScript today to do data processing, but that actually is not all that easy because the data I need to process is in UTF-8. Now to use the string class in AssemblyScript I actually have to do a pointless data conversion to WTF-16 and back.
I would be majorly surprised if JavaScript doesn’t adopt UTF-8 at one point as well.
I do understand the desire to switch all languages and systems to one encoding, of course. However, Switching a WTF-16 language to UTF-8 removes previously valid values from strings, then exchanging what's errors/mutation on Component boundaries right now with errors/mutation when using string APIs. Can't be done in a backwards-compatible way, and all these languages have a lot of existing code. If backwards compatibility is a goal (say when using a breadcrumbs mechanism as in Swift), one still ends up with WTF-8 underneath, which maps to WTF-16, but is not UTF-8. Hence why I think it's impossible, because the only way to pull this off is by replacing affected string APIs (and/or accepting that old APIs then throw or mutate). Likewise, I see a possible future where JS adopts breadcrumbs, but then with WTF-8 (and perhaps a well-formedness flag), not guaranteed UTF-8. In your use case, that would yield a fast-path if a string is well-formed, but still with the same old fallback. Plus, of course, that having a systems fast-path implies that there is a corresponding JS-interop slow-path (when using AS).
PyPy uses utf-8 internally and it’s completely hidden from the user. That’s however possible because in Python there was always a UCS2/USC4 leak to the user code so you could never really rely on anything.
I expect other languages to make the switch sooner or later.
I do think though that this is not all that interesting for the issue here. WASI needs to pick some format and picking UTF-8 is fine. Roundtripping half broken UTF-16 is something that does not need preserving.
I think enforcing UTF-8 there won’t be much of an issue in practice.
I guess we are about to find out whether there is substance to the precedents. My bet is on "what can go wrong, will go wrong", even more so on Web scale. Let's hope I'm wrong.
For context, I once gave a presentation about the pitfalls: https://www.youtube.com/watch?v=Ri2NMnSQo4o