This uses dense (soft/weighted) transitions from any state to any state, and then some regularization to guide it to more sparse solutions.
In practice, the number of states can be huge (thousands, maybe millions), so representing this as a dense matrix (a 1Mx1M matrix is way too big) is not going to work. It must be sparse, and in practice (all FST you usually deal with) it is. So it's very much a waste to represent it as a dense matrix.
That's why there are many specialized libraries to deal with FSTs. Also in combination with deep learning tools. See e.g. K2 (https://github.com/k2-fsa/k2).
Although this blog post focuses on learning small toy examples, I could see the approach scaling up to real problems in natural language processing.
At one point, Google's own speech recognition system was built on a C++ finite state transducer library [1][2]. FSTs were used to represent the language model, reverse dictionary (map from phonemes to words), hidden markov models, and so on. Although FSTs were a common abstraction used at runtime, at training time, we still needed special-purpose ML algorithms to create models that were later "compiled" into FSTs. [3]
So I could see how a general-purpose differentiable FSM, if scaled up, could actually produce a really powerful simplification -- you would just need one end-to-end learning algorithm to subsume all these various models.
Someone else has already posted this above, but the K2 project [0] is intended to do exactly this (along with some nice integration into PyTorch).
There's a big cross-over with the Kaldi team, so this seems (partly) motivated by a desire to improve/expand Kaldi to combine WFSTs with auto-grad learning (though K2 as a whole is a generic framework, and nothing is specific to voice).
It's still in its early days and I haven't delved into the details yet, but I suspect this will be the go-to method for speech recognition in a few years' time. Given that a few other teams are working on similar toolkits (see elsewhere in this thread), it seems a lot of people see promise.
For anyone coming in a bit confused by this, the magic here is in the concept of "differentiable". I'm fairly frequently touting the benefits of thinking in terms of differentiable programming and the most common response is "what does that even mean?"
The core idea is that any program that is "differentiable" means you can take the derivative of that program the same way you can take the derivative of a simple function in calculus.
Why does that matter? Well being able to take the derivative of any problem in turn means you can then optimize that problem with gradient descent. Chris Olah recently had a great discussion on Twitter [0] about what's great about gradient descent, in short:
> Simple gradient descent creates mind-boggling structure and behavior, just as evolution creates the awe inspiring complexity of nature.
Differentiable programming is, in a strange way, similar to logic programming in the sense that you can start solving problems by defining what the solution looks like. In this example you define the input and desired output and then learn the FSM that maps one to the other. Recall from Theory of Computation classes that a FSM the simplest, least powerful programming language (FSM an example of a regular language).
What's amazing here is that you're essentially seeing gradient descent writing a program. IMHO, that is insanely cool.
What makes differentiable programming so interesting compared to something like logic programming is that you have a potentially much larger space of problems that can be solved in this manner and compared to the backtracking algorithm used by logic programming languages, gradient descent allows for potentially faster (though also approximate) solutions at scale.
Currently the vast majority of differentiable programming examples are essentially just building neural nets, but examples like this one show that we can think much more broadly about problems in terms of differentiable programming as a proper programming paradigm.
Seems like in this differentiable FSM implementation you only differentiate edge weighs, you can not introduce or change nodes. So it is quite far from differentiation of programs.
But if you add a huge number of nodes and put regularization on the sparsity of the edge weights, you have essentially a model which can adapt to problems using subsets of its structure. Somewhat like the "lottery ticket" hypothesis in NN theory. Then you can think of each traversal choice as a conditional branch and voila, the program runs!
You can see this in effect actually in the article when the author uses the penalty on the entropy of the transition probabilities.
Something that just occurred to me: finite state machines don't have to be a single "flat" graph. There are hierarchical/nested variations. For example, imagine a factory with many identical machines, where each machine has the same FSM for its operation, and then there's a FSM for the factory as a whole, modelling the interactions of the machines.
I wonder, I wonder, I wonder... could there be a differentiable and deeply nested FSM hierarchy model that -- like deep learning for neural nets -- can solve qualitatively different categories of problems compared to a "plain" FSM? As in, just as deep learning revolutionised ML, I wonder if there is an equivalent "deep FSM learning".
It smells like there's something fundamental there: Take "any" simple, differentiable systems like a layer of neural nets, an FSM, or whatever... and then make it "deep". Similarly, I wonder if all of the other concepts from NNs could have equivalents for similar differentiable systems, such as convolution, etc...
The standard operation for FSM to compose knowledge sources is composition. You can compose different level graphs and differentiate them too. Awni's paper talks about that https://openreview.net/pdf?id=MpStQoD73Mj. The problem is that you need to differentiate graph structure, not just weights. This is a problem since the weights are inherently non-continuous.
> For anyone coming in a bit confused by this, the magic here is in the concept of "differentiable". I'm fairly frequently touting the benefits of thinking in terms of differentiable programming and the most common response is "what does that even mean?"
One way of looking at it is "differentiable is a specific mathematical claim from calculus, usually about equations or inequalities, that says that you can perform a specific operation to get a reduced variant." People in ML care about differentiability because many of our most important tools, such as gradient descent, require it as a characteristic.
Another way of looking at it is "this is not a useful product and with enough vocabulary people may not notice that"
The intersection of graphs, FSMs, and linear algebra is really wild. I touched a very tiny bit on this topic a few weeks ago [0]. I highly recommend the book "Mathematics of Big Data" [1] if you're interested in this subject.
If you're interested in these kinds of things, many years ago we created TerpreT (https://arxiv.org/pdf/1608.04428.pdf and https://github.com/51alg/TerpreT) to look into generic program synthesis problems, using a set of very different techniques (gradient descent, ILP, SMT) on different problem settings (turing machines, boolean circuits, LLVM IR-style basic blocks, and straight assembly).
I've done a triple-take on this also. For some reason I thought of differentiation in the state-space. Which is, off-course, oximoronic. Fuzzy logic-like on n-th degree???
In practice, the number of states can be huge (thousands, maybe millions), so representing this as a dense matrix (a 1Mx1M matrix is way too big) is not going to work. It must be sparse, and in practice (all FST you usually deal with) it is. So it's very much a waste to represent it as a dense matrix.
That's why there are many specialized libraries to deal with FSTs. Also in combination with deep learning tools. See e.g. K2 (https://github.com/k2-fsa/k2).