Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

When I wrote a very primitive UTF-8 library, I really began to appreciate UTF-8's design. For example; the first byte says how many bytes the character requires. At first it was daunting, but when I put 2 and 2 together, it really opened up.

I am sure there are many aspects I am missing about UTF-8, but it is all reasonable in its design and implementation.

For reference, I was converting between code points and actual bytes, and also implemented strlen and strcmp (which for the latter the standard library apparently handles fine).



The self-synchronizing property is also very clever. If you start at an arbitrary byte, you can find the start of the next character by scanning forward a maximum of 3 bytes.


And scanning backwards works too.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: