Thank you for posting this code on Github! There has been some reverse-engineeri...

krackers · on Sept 13, 2021

>Apple's word segmentation

Unless they changed it, it's probably similar to CFStringTokenizer which used ICU Boundary Analysis (and maybe mecab for Japanese).

peterburkimsher · on Sept 13, 2021

Thank you! The ICU Boundary Analysis documentation says it uses a dictionary to split Chinese, Japanese, Thai or Khmer.

Is that the same as the macOS dictionary being parsed here? It seems like a pretty big file to grep every time!

krackers · on Sept 13, 2021

I assume at compile time it's converted to a more efficient query format