Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I have some unicode string truncation code at work. It just mindlessly chops off any codepoints that won't fit in N bytes. No worrying about grammar, combining characters, multi-codepoint-emoji, etc.

This is because the output doesn't have to be perfect, but it does absolutely positively have to have bounded length or various databases start getting real grumpy.



If you're chopping a diacritic off, you're changing meaning. If you're chopping an emoji off with a dangling ZWJ, you've potentially got an invalid character. Depending on the language and text, you might be completely changing the meaning of what you're storing.

Your database might be grumpy otherwise, but that doesn't make arbitrary truncation correct. This is an issue with your schema, it doesn't mean truncation is the best solution.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: