Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In NodeJS for example don't you have use Buffers and special decoders to deal with UTF-8 strings? I.e it's a pain there too.


I don't think that's a pain. It's making explicit what should be explicit and the decoded string doesn't have an encoding attached (like in Ruby), it can't be in an unexpected format, it's always UTF-16. One can argue about weather UTF-16 is the best choice, but at least it's always that and always Unicode. No surprises.


No, JS strings are UTF-8:

    > '蛋糕'.substr(0,1)
    '蛋'
    > '蛋糕'.length
    2
    > Buffer.byteLength('蛋糕')
    6
You do have to be careful when working with binary data (e.g. streams) but this is expected.


They're UTF-16, and substr(), length, etc, work at the code unit level. Hence, the above isn't actually valid for all strings - any characters that are represented by codepoints between U+10000 and U+10FFFF require 2 code units [1]. For example U+10429 Deseret Small Letter Long E [2]

  > '𐐩'.substr(0, 1)
  '\ud801'
  > '𐐩'.length
  2
[1] https://en.wikipedia.org/wiki/UTF-16#Description

[2] https://codepoints.net/U+10429


TIL thanks :) Interestingly, "for of" iteration works on the whole character, so must be some magic going on under the hood.


And with that you're completely wrong, since strings in JavaScript are UTF-16.

It just so happens that your example consists of two UTF-16 codepoints.

(Node.js' Buffer uses UTF-8 by default).


One ambiguity here might be that Javascript defines strings as UTF-16, but JSON defines strings as UTF-8.


The 蛋糕 is a lie!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: