When I wrote a very primitive UTF-8 library, I really began to appreciate UTF-8'... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		ftvy on April 14, 2020 \| parent \| context \| favorite \| on: UTF-8 Everywhere When I wrote a very primitive UTF-8 library, I really began to appreciate UTF-8's design. For example; the first byte says how many bytes the character requires. At first it was daunting, but when I put 2 and 2 together, it really opened up. I am sure there are many aspects I am missing about UTF-8, but it is all reasonable in its design and implementation. For reference, I was converting between code points and actual bytes, and also implemented strlen and strcmp (which for the latter the standard library apparently handles fine).

TheCoelacanth on April 14, 2020 [–]

The self-synchronizing property is also very clever. If you start at an arbitrary byte, you can find the start of the next character by scanning forward a maximum of 3 bytes.

account42 on April 15, 2020 | [–]

And scanning backwards works too.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact