It does not. Your code is using precomposed instead of decomposed chars. Try: le...

LinAGKar · on June 23, 2021

Although it's not in the standard library, this can be handled properly using the unicode-segmentation crate: https://crates.io/crates/unicode-segmentation

mortoray · on June 23, 2021

It's a bit disappointing that a relatively new language like Rust doesn't handle these correctly.

There's no value in a String type if it doesn't behave for text. One can simply use a "Array<char>" to convey proper semantic meaning.

jph · on June 23, 2021

Can you explain more?

I updated my post with more info and a link to a code repo.

The code correctly shows me the reverse "lëon" with the umlaut, and the first three characters "noë" with the umlaut.

ridiculous_fish · on June 23, 2021

ë may be represented in two ways:

1. One code point: U+00EB. This is the "precomposed" form.

2. Two code points: U+0065 U+0308, aka e followed by ¨. This is the "decomposed" form, also known as a "combining sequence" since the diaeresis combines with the base character.

If your string type is a sequence of code points, then reversing a decomposed string will tear the combining sequence, and apply the diaeresis to the wrong character. Most string types are affected by this (or worse), Rust included.

The two forms get rendered identically, so you more or less need a hex editor to figure out which form you've got. I forked your repo and switched it to decomposed (the diff looks like a noop), and now it produces the wrong output:

https://github.com/ridiculousfish/demo-rust-string-issues

Another Rust example using explicit Unicode literals: https://play.rust-lang.org/?version=stable&mode=debug&editio...

One could reasonably conclude that precomposed forms are just better and easier. But they're considered legacy: we can't encode every possible combining sequence into a code point, so we might as well go the other way and decompose whenever possible. That's what Normalization Form D is about.

jph · on June 23, 2021

Much obliged, thank you. I'm adding info about this to the repo, and also adding precomposed characters versus decomposed characters. Now I do see the problem you're describing and the article author is describing.