Differentiable Finite State Machines

albertzeyer · on June 8, 2022

This uses dense (soft/weighted) transitions from any state to any state, and then some regularization to guide it to more sparse solutions.

In practice, the number of states can be huge (thousands, maybe millions), so representing this as a dense matrix (a 1Mx1M matrix is way too big) is not going to work. It must be sparse, and in practice (all FST you usually deal with) it is. So it's very much a waste to represent it as a dense matrix.

That's why there are many specialized libraries to deal with FSTs. Also in combination with deep learning tools. See e.g. K2 (https://github.com/k2-fsa/k2).

woodson · on June 8, 2022

FB research has their own version of automatic differentiation of WFSTs: https://github.com/gtn-org/gtn

See also https://github.com/facebookresearch/gtn_applications which contains examples of applications such as handwriting recognition and speech recognition.

puttycat · on June 8, 2022

Indeed this would need to predefine a maximum number of states.

This work referenced also in the OP paper uses recurrent neural nets and a genetic algorithm to evolve the minimal number of states needed:

https://arxiv.org/abs/2111.00600

brandonb · on June 8, 2022

Although this blog post focuses on learning small toy examples, I could see the approach scaling up to real problems in natural language processing.

At one point, Google's own speech recognition system was built on a C++ finite state transducer library [1][2]. FSTs were used to represent the language model, reverse dictionary (map from phonemes to words), hidden markov models, and so on. Although FSTs were a common abstraction used at runtime, at training time, we still needed special-purpose ML algorithms to create models that were later "compiled" into FSTs. [3]

So I could see how a general-purpose differentiable FSM, if scaled up, could actually produce a really powerful simplification -- you would just need one end-to-end learning algorithm to subsume all these various models.

[1] https://www.openfst.org/twiki/bin/view/FST/WebHome

[2] Much of this has obviously changed since the popularization of deep learning -- this is the 2011-era system.

[3] A finite state transducer is a finite state machine with input and output labels, so that it translates between two regular languages

nmfisher · on June 9, 2022

Someone else has already posted this above, but the K2 project [0] is intended to do exactly this (along with some nice integration into PyTorch).

There's a big cross-over with the Kaldi team, so this seems (partly) motivated by a desire to improve/expand Kaldi to combine WFSTs with auto-grad learning (though K2 as a whole is a generic framework, and nothing is specific to voice).

It's still in its early days and I haven't delved into the details yet, but I suspect this will be the go-to method for speech recognition in a few years' time. Given that a few other teams are working on similar toolkits (see elsewhere in this thread), it seems a lot of people see promise.

https://github.com/k2-fsa

time_to_smile · on June 8, 2022

For anyone coming in a bit confused by this, the magic here is in the concept of "differentiable". I'm fairly frequently touting the benefits of thinking in terms of differentiable programming and the most common response is "what does that even mean?"

The core idea is that any program that is "differentiable" means you can take the derivative of that program the same way you can take the derivative of a simple function in calculus.

Why does that matter? Well being able to take the derivative of any problem in turn means you can then optimize that problem with gradient descent. Chris Olah recently had a great discussion on Twitter [0] about what's great about gradient descent, in short:

> Simple gradient descent creates mind-boggling structure and behavior, just as evolution creates the awe inspiring complexity of nature.

Differentiable programming is, in a strange way, similar to logic programming in the sense that you can start solving problems by defining what the solution looks like. In this example you define the input and desired output and then learn the FSM that maps one to the other. Recall from Theory of Computation classes that a FSM the simplest, least powerful programming language (FSM an example of a regular language).

What's amazing here is that you're essentially seeing gradient descent writing a program. IMHO, that is insanely cool.

What makes differentiable programming so interesting compared to something like logic programming is that you have a potentially much larger space of problems that can be solved in this manner and compared to the backtracking algorithm used by logic programming languages, gradient descent allows for potentially faster (though also approximate) solutions at scale.

Currently the vast majority of differentiable programming examples are essentially just building neural nets, but examples like this one show that we can think much more broadly about problems in terms of differentiable programming as a proper programming paradigm.

0. https://twitter.com/ch402/status/1533164918886703104?cxt=HHw...

nshm · on June 8, 2022

Seems like in this differentiable FSM implementation you only differentiate edge weighs, you can not introduce or change nodes. So it is quite far from differentiation of programs.

uoaei · on June 8, 2022

But if you add a huge number of nodes and put regularization on the sparsity of the edge weights, you have essentially a model which can adapt to problems using subsets of its structure. Somewhat like the "lottery ticket" hypothesis in NN theory. Then you can think of each traversal choice as a conditional branch and voila, the program runs!

You can see this in effect actually in the article when the author uses the penalty on the entropy of the transition probabilities.

jiggawatts · on June 9, 2022

Something that just occurred to me: finite state machines don't have to be a single "flat" graph. There are hierarchical/nested variations. For example, imagine a factory with many identical machines, where each machine has the same FSM for its operation, and then there's a FSM for the factory as a whole, modelling the interactions of the machines.

I wonder, I wonder, I wonder... could there be a differentiable and deeply nested FSM hierarchy model that -- like deep learning for neural nets -- can solve qualitatively different categories of problems compared to a "plain" FSM? As in, just as deep learning revolutionised ML, I wonder if there is an equivalent "deep FSM learning".

It smells like there's something fundamental there: Take "any" simple, differentiable systems like a layer of neural nets, an FSM, or whatever... and then make it "deep". Similarly, I wonder if all of the other concepts from NNs could have equivalents for similar differentiable systems, such as convolution, etc...

nshm · on June 12, 2022

The standard operation for FSM to compose knowledge sources is composition. You can compose different level graphs and differentiate them too. Awni's paper talks about that https://openreview.net/pdf?id=MpStQoD73Mj. The problem is that you need to differentiate graph structure, not just weights. This is a problem since the weights are inherently non-continuous.

JohnHaugeland · on June 9, 2022

> For anyone coming in a bit confused by this, the magic here is in the concept of "differentiable". I'm fairly frequently touting the benefits of thinking in terms of differentiable programming and the most common response is "what does that even mean?"

One way of looking at it is "differentiable is a specific mathematical claim from calculus, usually about equations or inequalities, that says that you can perform a specific operation to get a reduced variant." People in ML care about differentiability because many of our most important tools, such as gradient descent, require it as a characteristic.

Another way of looking at it is "this is not a useful product and with enough vocabulary people may not notice that"

melony · on June 8, 2022

Regular expressions were invented to model neurons (predating gradient descent) by Kleene of the Kleene star.

C.f. "Representation of Events in Nerve Nets and Finite Automata"

Before backpropagation was a glimmer in LeCun's eyes.

jonstewart · on June 9, 2022

“In the terms of Brzozowski…”

mcovalt · on June 8, 2022

The intersection of graphs, FSMs, and linear algebra is really wild. I touched a very tiny bit on this topic a few weeks ago [0]. I highly recommend the book "Mathematics of Big Data" [1] if you're interested in this subject.

[0] https://news.ycombinator.com/item?id=31344146

[1] https://mitpress.mit.edu/books/mathematics-big-data

flakiness · on June 8, 2022

For people like me who just wanted fancy pictures, they have us covered: https://distill.pub/selforg/2021/textures/ (titled "Self-Organising Textures")

mabrocks · on June 9, 2022

If you're interested in these kinds of things, many years ago we created TerpreT (https://arxiv.org/pdf/1608.04428.pdf and https://github.com/51alg/TerpreT) to look into generic program synthesis problems, using a set of very different techniques (gradient descent, ILP, SMT) on different problem settings (turing machines, boolean circuits, LLVM IR-style basic blocks, and straight assembly).

tromp · on June 9, 2022

This article made me look up the meaning of one-hot:

https://en.wikipedia.org/wiki/One-hot

xbar · on June 8, 2022

You had me at finite.

novosel · on June 8, 2022

I've done a triple-take on this also. For some reason I thought of differentiation in the state-space. Which is, off-course, oximoronic. Fuzzy logic-like on n-th degree???

sdenton4 · on June 9, 2022

There are nice tricks for getting differentiable discrete sampling; Gumbel Soft Max most famously.

https://casmls.github.io/general/2017/02/01/GumbelSoftmax.ht...

JohnHaugeland · on June 9, 2022

"We trained a neural network to produce a finite state machine that will usually consume the string it's given"

jpeg_hero · on June 8, 2022

seems complicated.