Show HN: Subsequential Transducer for Efficient Text Rewriting

k2xl · on July 5, 2017

Could someone explain why this is better than simple string substitution? Is regular string substitution much slower than this?

binarymax · on July 5, 2017

I am not yet 100% certain so I requested full-text access to read the paper, but I can say that a naive approach to string substitution will certainly be less performant than this, and offer less flexibility in replacement complexity without significant effort.

The reason being that in a naive approach, a vocabulary of size M and a document of token size N is an O(m*n) operation in the best case - while this is claimed to be an operation of O(2n).

jaytaylor · on July 5, 2017

Came here to ask the same question, I'm not clear on what functionality this offers beyond a simple k-v substitution.

andy_ppp · on July 5, 2017

I’ve wanted this specifically for terms and conditions documents for a long while... any thoughts on it’s applicability to this subject; so many legal document sections are the same but you can’t just copy them.

tyingq · on July 5, 2017

What would it do that some basic templating library wouldn't do?

Not knowing the specifics, I would lean toward a template library, as it would be more precise, and could account for things like changing gender, plurality, a vs an, etc, in a structured way.

deniskyashif · on July 5, 2017

What specifically do you want to do with these documents?

brudgers · on July 5, 2017

If it meets the guidelines, this might make a good 'Show HN'. Show HN guidelines: https://news.ycombinator.com/showhn.html

binarymax · on July 5, 2017

Interesting project. Question on this: "For any text t of length |t| the time it takes to perform a rewrite is O(|t|+|t'|) where t' denotes the resulting output string"

Wouldn't the vocabulary size fit into the order complexity? Vocabs that would be considered useful in this context tend to be quite large. Are you achieving worst case logn of search in the vocab?

deniskyashif · on July 5, 2017

Hi, thanks! The outputs are encoded in the transitions and the states themselves, so the vocabulary size doesn't affect the search. It affects, however, the construction because for each node in the trie we have to add an outgoing transition for each distinct symbol of the input vocabulary.