I am not yet 100% certain so I requested full-text access to read the paper, but I can say that a naive approach to string substitution will certainly be less performant than this, and offer less flexibility in replacement complexity without significant effort.
The reason being that in a naive approach, a vocabulary of size M and a document of token size N is an O(m*n) operation in the best case - while this is claimed to be an operation of O(2n).
I’ve wanted this specifically for terms and conditions documents for a long while... any thoughts on it’s applicability to this subject; so many legal document sections are the same but you can’t just copy them.
What would it do that some basic templating library wouldn't do?
Not knowing the specifics, I would lean toward a template library, as it would be more precise, and could account for things like changing gender, plurality, a vs an, etc, in a structured way.
Interesting project. Question on this: "For any text t of length |t| the time it takes to perform a rewrite is O(|t|+|t'|) where t' denotes the resulting output string"
Wouldn't the vocabulary size fit into the order complexity? Vocabs that would be considered useful in this context tend to be quite large. Are you achieving worst case logn of search in the vocab?
Hi, thanks! The outputs are encoded in the transitions and the states themselves, so the vocabulary size doesn't affect the search. It affects, however, the construction because for each node in the trie we have to add an outgoing transition for each distinct symbol of the input vocabulary.