Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Actually the native character encoding in Windows is UTF-16 so you can't just assume UTF-8. If you're writing low level code, you have to convert UTF-8 to UTF-16 and back.

Yes. You should convert your strings. Thankfully, UTF-16 is very difficult to confuse with UTF-8 because they're completely incompatible encodings. Conversion is (or should be) a relatively simple process in basically any modern language or environment. And personally, I've never run into a problem where the difference between NFC and NFD mattered. (Do you have an example?). The different forms are (or should be) visually completely identical for the user - at least on modern computers with decent unicode fonts.

The largest problem with UTF-8 (and its biggest strength) is how similar it is to ASCII. It is for this reason we should consider emoji to be a wonderful gift to software correctness everywhere. Correctly handling emoji requires that your software can handle unicode correctly - because they need multi-unit encoding with both UTF-16 and UTF-8. And emoji won't render correctly unless your software can also handle grapheme clusters.

> When people living in ASCII world casually said "I just assume UTF-8", in reality, you still assume it's ASCII.

Check! If your application deals with text, throw your favorite multi-codepoint emoji into your unit testing data. (Mine is the polar bear). Users love emoji, and your software should handle it correctly. There's no excuse! Even the windows filesystem passes this test today.



This.

My native language uses some additional CJK chars on plane 2, and before ~2010s a lot of software had glitches beyond the basic plane of unicode. I am forever grateful for the "Gen Z" who pushed for Emojis.

Javascript's String.length is still semantically broken though. Too bad it's part of a unchangeable spec...


There's no definition of String.length that would be the obvious right choice. It depends on the use case. So probably better to let the application provide its own implementation.


> So probably better to let the application provide its own implementation.

I’d be very happy with the standard library providing multiple “length” functions for strings. Generally I want three:

- Length in bytes of the utf-8 encoded form. Eg useful for http’s content-length field.

- Number of Unicode codepoints in the text. This is useful for cursor positions, CRDT work, and some other stuff.

- Number of grapheme clusters in the text when displayed.

These should all be reasonably easy to query. But they’re all different functions. They just so happen to return the same result on (most) ascii text. (I’m not sure how many grapheme clusters \0 or a bell is).

Javascript’s string.length is particularly useless because it isn’t even any of the above methods. It returns the number of bytes needed to encode the string as UTF16, divided by 2. I’ve never wanted to know that. It’s a totally useless measure. Deceptively useless, because it’s right there and it works fine so long as your strings only ever contain ascii. Last I checked, C# and Java strings have the same bug.


I don't know about Java, but the C# standard library is exceptionally well design with respect to variable byte encoding. https://learn.microsoft.com/en-us/dotnet/standard/base-types...

The built-in string.length method is useless (it returns the number of char objects) and I agree that's a problem, but the solution is also built into the language, unlike in JS.


JS these days also has ways to iterate over codepoints and grapheme clusters. If you treat the string as an iterator, then its elements will be single-codepoint strings, on which you can call .codePointAt(0) to get the values. (The major JS engines can allegedly elide the allocations for this.) The codepoint count can be obtained most simply with [...string].length, or more efficiently by looping over the iterator manually.

The Intl.Segmenter API [0] can similarly yield iterable objects with all the grapheme clusters of a string. Also, the TextEncoder [1] and TextDecoder [2] APIs can be used to convert strings to and from UTF-8 byte arrays.

[0] https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

[1] https://developer.mozilla.org/en-US/docs/Web/API/TextEncoder

[2] https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder


Don't you want grapheme clusters for cursor positions?

Length in encoded form can be found after encoding by checking the length of the binary content I guess.

I think for historical reasons access to codepoints can be useful, but it's rarely what one wants.


There's Intl.Segmenter now which does Unicode Segmentation to count the amount of graphemes for example: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

Though you are right in that I don't know of a built-in way to count Unicode Scalar Values (USVs).


JavaScript’s recently added implementation of String[Symbol.iterator] iterates through Unicode characters. So for example, [...str] will split any string into a list of Unicode scalar values.


Yep. I don't use eslint, but if I did I would want a lint against any use of string.length. Its almost never what you want. Especially now that javascript supports unicode through [...str].


String.length is fine, since it counts UTF16 (UCS2?) code units. The attribute was only accidentally useful for telling how many characters were in a string for a long time, so people think it should work that way.


> I've never run into a problem where the difference between NFC and NFD mattered. (Do you have an example?)

The main place I've seen it get annoying is searching for some text in some other text. Unless you normalize the data you're searching through the same way as you normalize your search string.


After reading the above comment I went looking for Unicode documentation talking about the different normalisation formats. That was a point I read that surprised me because I hadn’t ever thought of it. They said search should be insensitive to normalisation form - so generally you should normalise all text before running a search.

That’s a great tip - obvious in hindsight but one I’d never considered.


> your software should handle it correctly. There's no excuse!

It is valid for the presentation of compound emoji can fallback to their component parts. You can't expect every platform to have an up to date database of every novel combination. A better test is emoji with color modifiers. Another good one is grandfathered symbols with both a text and emoji presentation and forcing the chosen glyph with the variant selector prefix.


> You can't expect every platform to have an up to date database of every novel combination.

On modern desktop OSes and smart phones, I do expect my platform to have an up-to-date unicode database & font set. Certainly for something like the unicode polar bear, which was added in 2020. I'll begrudgingly look the other way for terminals, embedded systems and maybe video games... but generally it should just work everywhere.

Server code generally shouldn't interact with unicode grapheme clusters at all. I'm struggling to think of any common, valid reason to use a unicode character database in 'normal' backend server code.

> Another good one is grandfathered symbols with both a text and emoji presentation and forcing the chosen glyph with the variant selector prefix.

I didn't know about that one. I'll have to try it out.


> I'm struggling to think of any common, valid reason to use a unicode character database in 'normal' backend server code.

Case insensitive search


Unicode gets used in places that aren't continuousky updated. Good luck showing a pirate flag emoji on an embedded device like an infotainment system.


I thing being continuously updated should be tied to receiving new external content.

It's fine to have an embedded device that's never updated, but never receives new content - it doesn't matter that a system won't be able to show a new emoji because it doesn't have any content that uses that new emoji.

However, if it is expected to display new and updated content from the internet, then the system itself has to be able to get updated and actually get updated, there's no acceptable excuses for that - if it's going to pull new content, it must also pull new updates for itself.


As the user/owner of the device, no thanks. It should have code updates if and only if I ask it to, which I probably won't unless you have some compelling reason. For the device owner, pulling new updates by itself is just a built in backdoor/RCE exploit, and in practice those backdoors are often used maliciously. I'd much rather my devices have no way to update and boot from ROM.


So like a CD player needs some way to get updates? I guess they could send out CDs with updates but approximately nobody would actually do that.


The fact that we have to go as far back as CD players for a decent example illustrates my point - the "CD player" content distribution model is long dead, effectively nobody sells CD players or devices like CD players, effectively nobody distributes digital content on CDs or things like CDs (like, CD sales are even below vinyl sales) - almost every currently existing product receives content through a channel where updates could trivially be sent as well.

And if we're talking about how new products should be designed, then the "almost" goes away and they 100% wouldn't receive new content through anything like CDs, the issue transforms from an irrelevant niche (like CDs nowadays) to a nonexistent one.


> Another good one is grandfathered symbols with both a text and emoji presentation and forcing the chosen glyph with the variant selector prefix.

I despise that Unicode retroactively applied default emoji presentation to existing symbols, breaking old text. Who the hell though that was a good idea?


That's one good thing emoji bring to the software developers mind set.

Before emoji, if somebody open a bug report like: "Your software doesn't handle UTF-8 correctly. It doesn't handle Japanese.",

the response was "Huh? We don't bother to support Japanese. Go pound sand. Close ticket with wontfix.".

Now it's "Your software doesn't handle UTF-8 correctly. It doesn't handle emoji" and we're like "Oh shit! My software can't handle my beloved emoji!"


Exactly I was and still am surprised how fast and wide the adoption of emojis went.


Possibly because the major software companies made it work on phones. Thus users saw it working on many apps and complain when your app fails to do this.

Google, Apple, IBM and MS also did a lot of localisation so their code bases deal with encoding.

It is FOSS Unix software that had the ASCII mindset, probably as C and C++ string types are ASCII and many programmers want to treat strings as arrays. The MacOS and Windows APIs do take UTF as their input not char * (agreed earlier versions did not put they have provided the UTF encodings for 25 years at least.


> And personally, I've never run into a problem where the difference between NFC and NFD mattered.

You mean like opening a file by name?


> (Mine is the polar bear).

Mine is the crying emoji.

And after enough failures in breaking the system, the 100 emoji.


Those are ok, but both of those emoji are represented as a single unicode codepoint. Some bugs (particularly UI bugs) only show up when multiple unicode characters combine to form a single grapheme cluster. I'd recommend something fancier.

I just tried it in gnome-terminal, and while the crying emoji works fine, polar bear or a country flag causes weird issues.


Crying emoji but with a different skin color?


That’d work!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: