Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Just a somewhat tangent thought: do you know of any tricks to speed up the construction phase of Aho Corasick? A recent problem of mine needed to use Aho Corasick but with several million strings in the dictionary it took a bit longer than I thought to construct it. Any tricks you know of to speed up construction? Don't particularly care about memory.


Hmmm. The phrasing of your question is a little ambiguous. I could interpret it two ways:

1. "Do you know if there are any config knobs in the aho-corasick crate that will speed up construction?"

2. "I have my own implementation of Aho-Corasick and construction isn't as fast as I hoped. What kinds of tricks can I employ to make my own implementation faster?"

I'd be happy to try and answer either of these, but I think it would be better to ask with a concrete example on the issue tracker Discussions: https://github.com/BurntSushi/aho-corasick/discussions

All of the tricks I know about (and I judged to be worth doing) are in the aho-corasick crate. There's really just no getting around the fact that you need to first build a trie (which requires visiting every byte in every pattern) and then do a breadth-first search across the entire trie to setup failure transitions. Making construction substantially faster probably requires some kind of innovation in those two parts at a fundamental level.

You can also cheat at this. That is, by designing your Aho-Corasick automaton to be zero-copy deserializable. The DFAs in regex-automata support this. I waver on whether to do this for aho-corasick, because it acts like a straight-jacket for evolving the internals of the library.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: