> Actually the native character encoding in Windows is UTF-16 so you can't just ...

hnfong · on April 30, 2024

This.

My native language uses some additional CJK chars on plane 2, and before ~2010s a lot of software had glitches beyond the basic plane of unicode. I am forever grateful for the "Gen Z" who pushed for Emojis.

Javascript's String.length is still semantically broken though. Too bad it's part of a unchangeable spec...

winternewt · on April 30, 2024

There's no definition of String.length that would be the obvious right choice. It depends on the use case. So probably better to let the application provide its own implementation.

josephg · on April 30, 2024

> So probably better to let the application provide its own implementation.

I’d be very happy with the standard library providing multiple “length” functions for strings. Generally I want three:

- Length in bytes of the utf-8 encoded form. Eg useful for http’s content-length field.

- Number of Unicode codepoints in the text. This is useful for cursor positions, CRDT work, and some other stuff.

- Number of grapheme clusters in the text when displayed.

These should all be reasonably easy to query. But they’re all different functions. They just so happen to return the same result on (most) ascii text. (I’m not sure how many grapheme clusters \0 or a bell is).

Javascript’s string.length is particularly useless because it isn’t even any of the above methods. It returns the number of bytes needed to encode the string as UTF16, divided by 2. I’ve never wanted to know that. It’s a totally useless measure. Deceptively useless, because it’s right there and it works fine so long as your strings only ever contain ascii. Last I checked, C# and Java strings have the same bug.

yau8edq12i · on April 30, 2024

I don't know about Java, but the C# standard library is exceptionally well design with respect to variable byte encoding. https://learn.microsoft.com/en-us/dotnet/standard/base-types...

The built-in string.length method is useless (it returns the number of char objects) and I agree that's a problem, but the solution is also built into the language, unlike in JS.

LegionMammal978 · on April 30, 2024

JS these days also has ways to iterate over codepoints and grapheme clusters. If you treat the string as an iterator, then its elements will be single-codepoint strings, on which you can call .codePointAt(0) to get the values. (The major JS engines can allegedly elide the allocations for this.) The codepoint count can be obtained most simply with [...string].length, or more efficiently by looping over the iterator manually.

The Intl.Segmenter API [0] can similarly yield iterable objects with all the grapheme clusters of a string. Also, the TextEncoder [1] and TextDecoder [2] APIs can be used to convert strings to and from UTF-8 byte arrays.

[0] https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

[1] https://developer.mozilla.org/en-US/docs/Web/API/TextEncoder

[2] https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder

singpolyma3 · on April 30, 2024

Don't you want grapheme clusters for cursor positions?

Length in encoded form can be found after encoding by checking the length of the binary content I guess.

I think for historical reasons access to codepoints can be useful, but it's rarely what one wants.

CryZe · on April 30, 2024

There's Intl.Segmenter now which does Unicode Segmentation to count the amount of graphemes for example: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

Though you are right in that I don't know of a built-in way to count Unicode Scalar Values (USVs).

josephg · on April 30, 2024

JavaScript’s recently added implementation of String[Symbol.iterator] iterates through Unicode characters. So for example, [...str] will split any string into a list of Unicode scalar values.

josephg · on April 30, 2024

Yep. I don't use eslint, but if I did I would want a lint against any use of string.length. Its almost never what you want. Especially now that javascript supports unicode through [...str].

extraduder_ire · on April 30, 2024

String.length is fine, since it counts UTF16 (UCS2?) code units. The attribute was only accidentally useful for telling how many characters were in a string for a long time, so people think it should work that way.

extraduder_ire · on April 30, 2024

> I've never run into a problem where the difference between NFC and NFD mattered. (Do you have an example?)

The main place I've seen it get annoying is searching for some text in some other text. Unless you normalize the data you're searching through the same way as you normalize your search string.

josephg · on April 30, 2024

After reading the above comment I went looking for Unicode documentation talking about the different normalisation formats. That was a point I read that surprised me because I hadn’t ever thought of it. They said search should be insensitive to normalisation form - so generally you should normalise all text before running a search.

That’s a great tip - obvious in hindsight but one I’d never considered.

kevin_thibedeau · on April 30, 2024

> your software should handle it correctly. There's no excuse!

It is valid for the presentation of compound emoji can fallback to their component parts. You can't expect every platform to have an up to date database of every novel combination. A better test is emoji with color modifiers. Another good one is grandfathered symbols with both a text and emoji presentation and forcing the chosen glyph with the variant selector prefix.

josephg · on April 30, 2024

> You can't expect every platform to have an up to date database of every novel combination.

On modern desktop OSes and smart phones, I do expect my platform to have an up-to-date unicode database & font set. Certainly for something like the unicode polar bear, which was added in 2020. I'll begrudgingly look the other way for terminals, embedded systems and maybe video games... but generally it should just work everywhere.

Server code generally shouldn't interact with unicode grapheme clusters at all. I'm struggling to think of any common, valid reason to use a unicode character database in 'normal' backend server code.

> Another good one is grandfathered symbols with both a text and emoji presentation and forcing the chosen glyph with the variant selector prefix.

I didn't know about that one. I'll have to try it out.

wheybags · on April 30, 2024

> I'm struggling to think of any common, valid reason to use a unicode character database in 'normal' backend server code.

Case insensitive search

kevin_thibedeau · on April 30, 2024

Unicode gets used in places that aren't continuousky updated. Good luck showing a pirate flag emoji on an embedded device like an infotainment system.

PeterisP · on April 30, 2024

I thing being continuously updated should be tied to receiving new external content.

It's fine to have an embedded device that's never updated, but never receives new content - it doesn't matter that a system won't be able to show a new emoji because it doesn't have any content that uses that new emoji.

However, if it is expected to display new and updated content from the internet, then the system itself has to be able to get updated and actually get updated, there's no acceptable excuses for that - if it's going to pull new content, it must also pull new updates for itself.

ndriscoll · on April 30, 2024

As the user/owner of the device, no thanks. It should have code updates if and only if I ask it to, which I probably won't unless you have some compelling reason. For the device owner, pulling new updates by itself is just a built in backdoor/RCE exploit, and in practice those backdoors are often used maliciously. I'd much rather my devices have no way to update and boot from ROM.

bee_rider · on April 30, 2024

So like a CD player needs some way to get updates? I guess they could send out CDs with updates but approximately nobody would actually do that.

PeterisP · on May 2, 2024

The fact that we have to go as far back as CD players for a decent example illustrates my point - the "CD player" content distribution model is long dead, effectively nobody sells CD players or devices like CD players, effectively nobody distributes digital content on CDs or things like CDs (like, CD sales are even below vinyl sales) - almost every currently existing product receives content through a channel where updates could trivially be sent as well.

And if we're talking about how new products should be designed, then the "almost" goes away and they 100% wouldn't receive new content through anything like CDs, the issue transforms from an irrelevant niche (like CDs nowadays) to a nonexistent one.

kps · on April 30, 2024

> Another good one is grandfathered symbols with both a text and emoji presentation and forcing the chosen glyph with the variant selector prefix.

I despise that Unicode retroactively applied default emoji presentation to existing symbols, breaking old text. Who the hell though that was a good idea?

ezoe · on April 30, 2024

That's one good thing emoji bring to the software developers mind set.

Before emoji, if somebody open a bug report like: "Your software doesn't handle UTF-8 correctly. It doesn't handle Japanese.",

the response was "Huh? We don't bother to support Japanese. Go pound sand. Close ticket with wontfix.".

Now it's "Your software doesn't handle UTF-8 correctly. It doesn't handle emoji" and we're like "Oh shit! My software can't handle my beloved emoji!"

sharpshadow · on April 30, 2024

Exactly I was and still am surprised how fast and wide the adoption of emojis went.

pasc1878 · on April 30, 2024

Possibly because the major software companies made it work on phones. Thus users saw it working on many apps and complain when your app fails to do this.

Google, Apple, IBM and MS also did a lot of localisation so their code bases deal with encoding.

It is FOSS Unix software that had the ASCII mindset, probably as C and C++ string types are ASCII and many programmers want to treat strings as arrays. The MacOS and Windows APIs do take UTF as their input not char * (agreed earlier versions did not put they have provided the UTF encodings for 25 years at least.

marcosdumay · on April 30, 2024

> And personally, I've never run into a problem where the difference between NFC and NFD mattered.

You mean like opening a file by name?

eclipticplane · on April 30, 2024

> (Mine is the polar bear).

Mine is the crying emoji.

And after enough failures in breaking the system, the 100 emoji.

josephg · on April 30, 2024

Those are ok, but both of those emoji are represented as a single unicode codepoint. Some bugs (particularly UI bugs) only show up when multiple unicode characters combine to form a single grapheme cluster. I'd recommend something fancier.

I just tried it in gnome-terminal, and while the crying emoji works fine, polar bear or a country flag causes weird issues.

rocqua · on April 30, 2024

Crying emoji but with a different skin color?

josephg · on April 30, 2024

That’d work!