The author seems to be under the impression that utf8 and Unicode are distinct, ...

dcode · on Nov 30, 2022

That's not the issue / impression. The issue is that well-formed UTF-16 is rarely used in practice. All of JavaScript, Java, C#, Dart, Kotlin etc. effectively use WTF-16 for compatibility and performance reasons, and that's what's semantically distinct from UTF-8. These have asymmetric value spaces, so that strings in "legacy" languages would sometimes throw or implicitly mutate synchronously. Mixed systems typically use WTF-8 as the common denominator for this reason, i.e. not UTF-8, but Wasm decided against.

the_mitsuhiko · on Nov 30, 2022

I'm sympathetic to the argument but I'm not convinced that the UTF-16 string format in those languages is much of a hurdle.

Even Python which adopted the pretty ridiculous internal UCS4 encoding is now carrying around a utf-8 pre-encoded version of strings for crossing boundaries. UTF-8 is just too widespread that many languages can avoid having to support it natively in some form.

Likewise WASM is not the first standard that has opinions about string encodings that are not native to a language. For instance Go and Rust which prefer to use UTF-8 internally have to re-encode on the way to Windows APIs (usually!). Likewise Cocoa/Objective-C traditionally use UCS-2 strings which are quite leaky, yet Swift nowadays uses UTF-8 internally and transcodes.

dcode · on Nov 30, 2022

To me it's not so much a question of what's the best / recommended (for new languages) / most used encoding. It's rather the observation that there are so many popular languages operating on Unicode Code Points, not Unicode Scalar Values, and Wasm said it wants to support these (as) well, incl. integration with JS/Web APIs which already share their semantics. And for all of these, the restriction is unnecessary and trivially avoidable by allowing them to pass strings to each other, or from/to JavaScript, without errors or mutation, while not changing anything for those preferring UTF-8. Looking at this the other way around might be valuable as well: What if one would design a Component Model for any of the affected languages? There, introducing an unnecessary discontinuity would be a design mistake I'd argue.

the_mitsuhiko · on Nov 30, 2022

I'm not sure I understand that point. JavaScript has one way to encode strings (WTF-16), many things that compile to WASM have another (UTF-8). The only thing that I see happening is that people generally are starting to recommend the use of UTF-8 as internal string encoding, and particularly across bridges.

The question is IMO not really about if someone operates on unicode code points or not, but what encoding strings shared across a WASM bridge should have and coming to one makes sense.

dcode · on Nov 30, 2022

In more technical jargon, it is about value spaces. All UTF-8 strings map to WTF-16 strings semantically (lists of Unicode Scalar Values are a subset of lists of Unicode Code Points), but some WTF-16 strings do not map to UTF-8 (Unicode Scalar Values exclude some Code Points). That's something UTF-8-based languages have to deal with anyway (on the Web, which is WTF-16, or any other mixed system), but it's odd that the expectation has become that even languages that map to each other incl. to JS/Web APIs now have to share a problem that does not exist in their native VMs for reasons. To emphasize: These languages just do what's perfectly fine in WebIDL, JSON, ECMAScript and their own language specifications.

  let myString = "...";
  let returnedMyString = someFunction(myString); // might or might not cross boundary
  if (myString == returnedMyString) {
    // sometimes silently false
  }

I am having a hard time to understand how so many people prefer this as the outcome of a Web / polyglot / language neutral standard, especially since affected languages cannot change, and the problem is so trivially avoided, say with a boolean flag "this end is WTF-16, it's OK if the other end is as well" (otherwise use well-formed/UTF-8 semantics).

For context, I once gave a presentation about the pitfalls: https://www.youtube.com/watch?v=Ri2NMnSQo4o

the_mitsuhiko · on Nov 30, 2022

> I am having a hard time to understand how so many people prefer this as the outcome of a Web / polyglot / language neutral standard, especially since affected languages cannot change, and the problem is so trivially avoided, say with a boolean flag "this end is WTF-16, it's OK if the other end is as well" (otherwise use well-formed/UTF-8 semantics).

It's because the idea that languages "cannot change" does not appear to be true. UTF-8 is so widespread now that for languages changing the native string representation towards it has become an interesting proposition. Many modern languages (eg: Go and Rust) already picked UTF-8, others such as Swift changed over to it. Then there are implementations of languages like Python (PyPy) that changed their internal encoding even though that was a widespread assumption that it cannot work.

The web is also not WTF-16, JavaScript is and the web consists of more than just that. WTF-16 to WTF-16 is most likely becoming less and less a thing going forward except for legacy interfaces such as W APIs on Windows and even there it appears that UTF-8 on the codepage level is now strongly recommended.

To give you another example: I'm very interested in using AssemblyScript today to do data processing, but that actually is not all that easy because the data I need to process is in UTF-8. Now to use the string class in AssemblyScript I actually have to do a pointless data conversion to WTF-16 and back.

I would be majorly surprised if JavaScript doesn’t adopt UTF-8 at one point as well.

dcode · on Nov 30, 2022

I do understand the desire to switch all languages and systems to one encoding, of course. However, Switching a WTF-16 language to UTF-8 removes previously valid values from strings, then exchanging what's errors/mutation on Component boundaries right now with errors/mutation when using string APIs. Can't be done in a backwards-compatible way, and all these languages have a lot of existing code. If backwards compatibility is a goal (say when using a breadcrumbs mechanism as in Swift), one still ends up with WTF-8 underneath, which maps to WTF-16, but is not UTF-8. Hence why I think it's impossible, because the only way to pull this off is by replacing affected string APIs (and/or accepting that old APIs then throw or mutate). Likewise, I see a possible future where JS adopts breadcrumbs, but then with WTF-8 (and perhaps a well-formedness flag), not guaranteed UTF-8. In your use case, that would yield a fast-path if a string is well-formed, but still with the same old fallback. Plus, of course, that having a systems fast-path implies that there is a corresponding JS-interop slow-path (when using AS).

the_mitsuhiko · on Nov 30, 2022

PyPy uses utf-8 internally and it’s completely hidden from the user. That’s however possible because in Python there was always a UCS2/USC4 leak to the user code so you could never really rely on anything.

I expect other languages to make the switch sooner or later.

I do think though that this is not all that interesting for the issue here. WASI needs to pick some format and picking UTF-8 is fine. Roundtripping half broken UTF-16 is something that does not need preserving.

I think enforcing UTF-8 there won’t be much of an issue in practice.

dcode · on Nov 30, 2022

I guess we are about to find out whether there is substance to the precedents. My bet is on "what can go wrong, will go wrong", even more so on Web scale. Let's hope I'm wrong.

Jasper_ · on Dec 1, 2022

Yep. See my comments on this the last time this came up: https://news.ycombinator.com/item?id=32767673