In practice with Lucene, len(string) and len(query) are like 10. So it's totally...

lorenzhs · on June 12, 2015

You can argue about the lengths of these strings and the simplicity of your step function all day long, at the end of the day it's still θ(n²) and no better than the simple dynamic programming approach. In fact, even for that it's a relatively bad implementation because it constructs a new list in every step, whereas you could just reuse the same two lists (called score in your implementation) all the time.

An L1 cache miss, L2 hit costs around 10 cycles, and the L2 cache is more than sufficiently large. Even your initialisation loop takes more than that. The min loop has 11 instructions (branchless) per iteration for the minimum calculations in C++: https://goo.gl/wjhRtb - not taking into account superscalarity.

You have not shown how you prune the search, so I can't say anything about that. Of course that's the entire point of having an index.

Your DFA construction again is massively slower than what is done in the paper. The authors show how to construct the entire automaton in time O(|string|), whereas each step in your implementation takes that much time.

Whether or not you built Levenshtein automata is a pointless discussion. You say you built a DFA for Levenshtein distance (true). I'm saying that you didn't implement the paper. Both are correct.

Look, I'm not claiming you did anything wrong. I'm just pointing out that your implementation, while it was fast to write, is also much much slower than their algorithm, and you shouldn't compare the two as if they were the same.

jules · on June 12, 2015

> You can argue about the lengths of these strings and the simplicity of your step function all day long, at the end of the day it's still θ(n²) and no better than the simple dynamic programming approach.

It's not O(n^2), but I forgot to mention that. You don't need to calculate the full state, only k entries when your maximum edit distance is k, because the other entries are going to be greater than k anyway. So it's actually O(nk), so it's linear in n (that paper assumes that the max edit distance and the alphabet size are fixed, so by their assumption O(nk) = O(n)). But I think you are missing what a Levenshtein automaton is. The point of a Levenshtein automaton is not to calculate the edit distance. In fact a Levenshtein automaton does not need to calculate the edit distance at all. What a Levenshtein automaton with max distance k to a word "banana" does is this: it determines whether given a prefix like "caba", is there a way to extend that prefix such that the distance to "banana" is less than k. The dynamic programming algorithm does not do this. They are two different (but of course related) problems.

> In fact, even for that it's a relatively bad implementation because it constructs a new list in every step, whereas you could just reuse the same two lists (called score in your implementation) all the time.

Of course, it's a proof of concept not a fast implementation. Plus it's not as simple as keeping two lists when you're searching a text index, because from the same state you will need to calculate multiple successor states. If you destroy the state s when stepping you can't step from that same state s with another letter.

> An L1 cache miss, L2 hit costs around 10 cycles, and the L2 cache is more than sufficiently large

A normal full text index does not even fit in L3. Perhaps it does not even fit in memory.

> You have not shown how you prune the search, so I can't say anything about that. Of course that's the entire point of having an index.

Pruning the search works exactly the same way given any implementation of a Levenshtein automaton. It's just intersecting the automaton with the index.

> Whether or not you built Levenshtein automata is a pointless discussion. You say you built a DFA for Levenshtein distance (true). I'm saying that you didn't implement the paper. Both are correct.

I never claimed to implement the algorithm in the paper. Whether or not it's a Levenshtein automaton is not a pointless discussion. Lucene improved their fuzzy text search times by 100x by pruning the search. You can do that with my method too. That's why the goal here is "implement a Levenshtein automaton" not "implement a Levenshtein automaton with the algorithm described in paper XYZ".

> I'm just pointing out that your implementation, while it was fast to write, is also much much slower than their algorithm, and you shouldn't compare the two as if they were the same.

Even if my algorithm for constructing the DFA is 10x slower than theirs (which I HIGHLY doubt), it still wouldn't matter because you can still use it to get their ~100x speedup because it does exactly the same pruning and the cost of the automaton is just noise.