In NodeJS for example don't you have use Buffers and special decoders to deal wi...

panzi · on May 4, 2024

I don't think that's a pain. It's making explicit what should be explicit and the decoded string doesn't have an encoding attached (like in Ruby), it can't be in an unexpected format, it's always UTF-16. One can argue about weather UTF-16 is the best choice, but at least it's always that and always Unicode. No surprises.

gbuk2013 · on May 4, 2024

No, JS strings are UTF-8:

    > '蛋糕'.substr(0,1)
    '蛋'
    > '蛋糕'.length
    2
    > Buffer.byteLength('蛋糕')
    6

You do have to be careful when working with binary data (e.g. streams) but this is expected.

njuw · on May 4, 2024

They're UTF-16, and substr(), length, etc, work at the code unit level. Hence, the above isn't actually valid for all strings - any characters that are represented by codepoints between U+10000 and U+10FFFF require 2 code units [1]. For example U+10429 Deseret Small Letter Long E [2]

  > '𐐩'.substr(0, 1)
  '\ud801'
  > '𐐩'.length
  2

[1] https://en.wikipedia.org/wiki/UTF-16#Description

[2] https://codepoints.net/U+10429

gbuk2013 · on May 4, 2024

TIL thanks :) Interestingly, "for of" iteration works on the whole character, so must be some magic going on under the hood.

trurl42 · on May 4, 2024

And with that you're completely wrong, since strings in JavaScript are UTF-16.

It just so happens that your example consists of two UTF-16 codepoints.

(Node.js' Buffer uses UTF-8 by default).

kiitos · on May 6, 2024

One ambiguity here might be that Javascript defines strings as UTF-16, but JSON defines strings as UTF-8.

DonHopkins · on May 4, 2024

The 蛋糕 is a lie!