Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Great questions — and you’re absolutely right that Latin vs. CJK is effectively two different universes in terms of reconstruction.

1. Latin vs. CJK differences Latin glyphs are structurally simple: limited stroke vocabulary, mostly predictable modulation, and relatively low topological variation. Once you can recover outlines and stroke junctions accurately, mapping to Unicode is almost trivial.

That can be done with standard OCR methods for Latin.

CJK is the opposite. Each character is effectively a miniature blueprint with dozens of micro-decisions: stroke order, brush pressure artifacts, serif style, shape proportion, and even regional typographic conventions. Treating it like Latin “but bigger” doesn’t work. So the workflow for CJK has extra normalization steps and more constraints, especially when reconstructing consistent glyph families rather than one-offs.

From a simple perspective, CJK has many characters with disconnected pieces that are still part of the same character.

2. How we match scans to Unicode entities We don’t rely on conventional OCR at all. OCR engines are optimized for reading text, not recovering the underlying design intent. Our process is closer to forensic glyph analysis — reconstructing stable structural signatures, then mapping those signatures to references.

This ends up being a hybrid: • deterministic structural matching • limited supervised correction when ambiguity exists • and zero reliance on any off-the-shelf OCR models

It’s not “OCR first, match later.” It’s “reconstruct the letterpress structure, then Unicode becomes a lookup.” OCR quality literally doesn’t limit us because OCR isn’t part of the critical path.

3. What determines coverage Coverage is defined by what we can physically access and reconstruct cleanly. For Latin, coverage is straightforward. For CJK, coverage is shaped by: • typeface completeness in the source material • the consistency of impression depth • survivability of fine strokes in early printings • and the practical question of how many thousand characters the original font designer actually cut

There’s no need for the entire Unicode set per book. The historical font only ever covered a finite subset. It is unfortunate that every book doesn't use every glyph, but not catastrophic because we can source many public domain books from the same era and eventually find enough characters matching the style.

In short: Latin is an engineering challenge. CJK is an archaeological one. OCR is not a bottleneck because we don’t use it. Coverage follows the historical material, not Unicode completeness.



Would love to hear or read more about this if it is public.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: