Hacker News new | past | comments | ask | show | jobs | submit login

It does not. Your code is using precomposed instead of decomposed chars.

Try:

    let s = String::from("noe\u{0308}l");
    println!("Reversable? {}", s.chars().rev().collect::<String>());
    println!(
        "First three characters? {}",
        s.chars().take(3).collect::<String>()
    );
you get l̈eon which (according to the article) is wrong; likewise "noe" as the first three chars, dropping the diacritic.



Although it's not in the standard library, this can be handled properly using the unicode-segmentation crate: https://crates.io/crates/unicode-segmentation


It's a bit disappointing that a relatively new language like Rust doesn't handle these correctly.

There's no value in a String type if it doesn't behave for text. One can simply use a "Array<char>" to convey proper semantic meaning.


Can you explain more?

I updated my post with more info and a link to a code repo.

The code correctly shows me the reverse "lëon" with the umlaut, and the first three characters "noë" with the umlaut.


ë may be represented in two ways:

1. One code point: U+00EB. This is the "precomposed" form.

2. Two code points: U+0065 U+0308, aka e followed by ¨. This is the "decomposed" form, also known as a "combining sequence" since the diaeresis combines with the base character.

If your string type is a sequence of code points, then reversing a decomposed string will tear the combining sequence, and apply the diaeresis to the wrong character. Most string types are affected by this (or worse), Rust included.

The two forms get rendered identically, so you more or less need a hex editor to figure out which form you've got. I forked your repo and switched it to decomposed (the diff looks like a noop), and now it produces the wrong output:

https://github.com/ridiculousfish/demo-rust-string-issues

Another Rust example using explicit Unicode literals: https://play.rust-lang.org/?version=stable&mode=debug&editio...

One could reasonably conclude that precomposed forms are just better and easier. But they're considered legacy: we can't encode every possible combining sequence into a code point, so we might as well go the other way and decompose whenever possible. That's what Normalization Form D is about.


Much obliged, thank you. I'm adding info about this to the repo, and also adding precomposed characters versus decomposed characters. Now I do see the problem you're describing and the article author is describing.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: