Hacker News new | past | comments | ask | show | jobs | submit login

Nice work OP.

I’ve done a fair amount of Chinese language segmentation programming - and yeah it’s not easy, especially as you reach for higher levels of accuracy.

You need to put in significant amounts of effort just for less than a few % point increases in accuracy.

For my own tools which focus on speed (and used for finding frequently used words in large bodies of text) I ended up opting for a first longest match algorithm.

It has a relatively high error rate, but it’s acceptable if you’re only looking for the first few hundred frequently used words.

What segmented are you using, or have you developed your own?




Thanks for the kind words!

I'm using Jieba[0] because it hits a nice balance of fast and accurate. But I'm initializing it with a custom dictionary (~800k entries), and have added several layers of heuristic post-segmentation. For example, Jieba tends to split up chengyu into two words, but I've decided they should be displayed as a single word, since chengyu are typically a single entry in dictionaries.

[0] https://github.com/fxsjy/jieba


Great project! It's fascinating how hard segmentation is and how many approaches there are. I thought I'd mention a trick that can let you segment without a backend. When you double click Chinese text in the browser, it will highlight an entire word. For example, try double clicking on the text here: 一步登天:走一步就到天堂美好境地。 It highlights/segments the first 4 characters as a chengyu, and the others as one or two character words. I haven't been able to discover what method Apple and Microsoft use to segment, but it seems to do a good job. You can even use JavaScript's Range.expand() function to do this programmatically. I once even made a little JS library that can run in the background and segment words on a page.


Last I checked, browsers basically wrap ICU's word-break iterator: https://unicode-org.github.io/icu/userguide/boundaryanalysis...


That’s neat!




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: