Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Not sure how it's supposed to be solved, but I thought of a cute hack in the special case where the character set of the strings is limited, e.g. lowercase ASCII only.

Let A be the alphabet used. Precompute the "letter sum" of each string and store it. Then given two strings with letter sums c1 and c2, they cannot have Levenshtein distance k if |c1-c2| > k|A|. This could allow you to skip a lot of rows.



I may be misunderstanding you, but an orderless sum of characters in a string won't be an effective prefilter for any string similarity algo.

Think "atc" and "cat". Same sum.


The sum forms a necessary but not sufficient condition for being within a certain Levenshtein distance. In your example, the inequality I gave above does not apply since |c1-c2| = 0. You would have to calculate the Levenshtein distance. In cases where the inequality is satisfied, you do not have to calculate the Levenshtein distance at all.

The idea is that any edit operation on a string will at most change the letter sum by |A|.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: