(What the article calls) Canonisation is an interesting issue - it seems very hard to get right and to do so efficiently.
I'm trying to wrap my head around a the issue in a regex parser that I've knocked up.
I'm currently ending up (expectedly) with multiple nodes representing equivalent languages; I want to strip these out when I use code generation to convert the constructed network into a switch based static automaton.
The most robust way to see whether two languages are the same is to xor them and test for interminability - but this means comparing each pair of nodes throughout the network, and I'd rather avoid the n^2 scaling if there's another option.
That option is to generate a canonical expression for the language that the machine represents, somewhat difficult but far more efficient when it comes to detecting collisions.
Canonizing is a fascinating research topic (which I’ve done quite a bit of work on).
The shortish version is if you have an unordered list, you can sort the list.
With graphs, you have to do a backtrack search — there are many ways to speed that search up, and in practice it is very, very quick. This algorithm is, I think, the coolest algorithm most people have never heard of. There is a fairly good description in the paper “Practical Graph Isomorphism II”.
For most structures more complicated than graphs, turns out you can often turn them into graphs, then use the existing graph software.
Unfortunately, I don’t think this will help your case, as the structure is that you can represent things in quite different ways, so this stuff won’t be much help :)
The current plan is not to perform any part of the canonisation on a graph, but to calculate a language table for the network and canonise the resultant expressions (which are trees not graphs).
I've mostly figured out the sorting aspect. The problematic part is language de-aliasing which means I'm going to have to refactor some things.
The language extraction (it's just iterative expression expansion and a language mapping function, example below) shouldn't be too hard to write down, I've already implemented it in my head and I don't think that there are any corner cases that I'm missing.
The language de-aliasing is going to change the way that certain expressions get translated into automata.
For example, right now there are generator methods for things like & and ^, but in order to de-alias the languages it's better if the expression gets expanded, and then the expanded form gets converted into a machine.
So ^(A,B) ≡ &(:(A,!B),:(!A,B)) ≡ !:(!:(A,!B),!:(!A,B)).
Even a simple expression such as !a needs to be de-aliased.
!/a ≡ :([.../.],[/b...])
(in my insane non-standard syntax, NOT 'a' is the OR of all chars up to and including '.' (the char before 'a') and all chars 'b' and above.
So just run through iterating away the state symbols until all that's left are pure expressions, attempt to de-alias the languages, then do a recursive sort on the expression trees.
I'm trying to wrap my head around a the issue in a regex parser that I've knocked up.
I'm currently ending up (expectedly) with multiple nodes representing equivalent languages; I want to strip these out when I use code generation to convert the constructed network into a switch based static automaton.
The most robust way to see whether two languages are the same is to xor them and test for interminability - but this means comparing each pair of nodes throughout the network, and I'd rather avoid the n^2 scaling if there's another option.
That option is to generate a canonical expression for the language that the machine represents, somewhat difficult but far more efficient when it comes to detecting collisions.
https://en.wikipedia.org/wiki/Canonization
https://en.wikipedia.org/wiki/Canonicalization