Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You'll get 80% of the benefit just by looking at word frequency, highlighting outliers and then a weight based on factors such as length and secret-sauce weighting.

Bonus points if you're using multiple categorizations (using different weights for different industries).

NLP / statistical stuff is fun ;)

Are you scanning / OCRing the documents? I never managed to get the OCR to be good enough for invoicing, there always had to be a manual process to fix the (machine-learning-flagged) errors.

Or don't you need accurate-to-the-cent invoices?



Word frequency is in use at many larger insurance companies today. You can certainly find problematic bills with word frequency and the hours billed, but you end up with a lot of false positives so you still have to manually review everything. We get in deeper than word frequency.

And, yes! NLP + statistics is fun!




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: