Differentiating SSA-Form Programs in Julia (2018) [pdf]

cs702 · on Feb 9, 2019

As the author explains, this approach to automatic differentiation (AD) via transformation of source code "supports control flow, higher-order functions and nested derivatives. The differentiated code can be further fed into a traditional compiler such as LLVM, which results in an extremely efficient derivative program. Further, it opens up the opportunity for robust traditional compiler techniques to be extended to machine learning, enabling kernel fusion or compilation for accelerators with no artificial limitations on the kinds of models that researchers can express. This combination has not previously been possible in a high-level, general-purpose programming language."

The author's package, Zygote, makes all Julia code differentiable, so any program can be optimized as an ML/AI model to learn a set of parameters given some training objective.[a]

:-)

[a] That said, you will not be able magically to overcome the limits of mathematics, in case you're wondering. See darawk's and b_tterc_p's comments below.

b_tterc_p · on Feb 9, 2019

I’m not sure I understand. Is this saying that any code which takes an input and maps it to an output can be given a target variable and an input range- and this package will determine what input minimizes distance to the target?

E.g. f(x) = x + 1

Go find me what input minimizes distance to 2?

Edit: and the reference to ML is given only because that’s a common use case for non linear optimization?

Edit 2: based on parent's edit [a] I think I've got it right above. Basically this won't allow us to optimize anything that wouldn't have been possible to optimize before. But it will be a much less painful experience. Just implement it and then define the input space + target. I do feel like there will be a grey space of thinking "is this a good program for optimization?" but its a really cool idea nonetheless.

one-more-minute · on Feb 9, 2019

Historically ML frameworks have only supported simple combinations of matrix multiply and convolution; it's really about generalising the idea of a model to anything that can mathematically be differentiated [1]. For example, an ODE solver is differentiable and we can throw it into the forward pass of a neural net [2].

People have done this kind of thing before, but only by e.g. reimplementing a very limited physics engine in Theano. We can just reuse existing code, along with the significant domain expertise embedded in it, and make this kind of thing a ten-line script.

[1] https://fluxml.ai/2019/02/07/what-is-differentiable-programm...

[2] https://julialang.org/blog/2019/01/fluxdiffeq

cs702 · on Feb 9, 2019

Mike:

THANK YOU for doing this work and sharing it with the world. It's awesome.

Because of your work, I've started experimenting with Julia for building experimental DL/ML models, instead of Python+PyTorch, which is the predominant stack we use at work today for iterative experimentation.

(For those here who don't know, one-more-minute is Mike Innes, the author of the paper.)

one-more-minute · on Feb 11, 2019

Thanks for the kind words! I'm lucky to be part of such a smart and driven community – there's a really strong shared sense of how ML should look in Julia.

I'm always happy to hear about how this stuff is being used, if people want to reach out.

b_tterc_p · on Feb 9, 2019

Thank you for this. These articles are exceptionally clear and it looks like you’re the author. Fantastic work.

psykotic · on Feb 9, 2019

Stochastic and mini-batch gradient descent work as well as they do in supervised learning because the function you're trying to minimize is a sum or other loosely coupled aggregation of losses over individual training samples, so you can update based on one sample or mini-batch at a time. That is, you're trying to do model fitting to a data set rather than minimizing the error of some arbitrary function with a ton of parameters; I'm oversimplifying a bit.

A critical feature of model fitting problems is that the data can be normalized across the training set prior to training. Without something like that, first-order methods are almost unusable as black-box methods; you have to do a ton of custom initialization and normalization work for each problem class. They don't have affine invariance although momentum methods can help. If you don't normalize your data or your model doesn't respond well to the usual forms of random weight initialization, you'll blow up or slow to a crawl on even trivial problems like linear least-squares. It's quite shocking that SGD and variants like coordinate descent work as well as they do, but it's important not to draw undue conclusions about its general applicability by understanding what properties of supervised learning problems contribute to its success.

I'm curious if ML experts reading this would strongly disagree with this general assessment. It's not my own area, but I've done a good deal of experimentation and working through things from scratch. If my conclusions are totally off, I'd want them to be corrected.

Anyway, yes, AD is valuable even if you don't use it for optimization. It has many applications in scientific computing and engineering; it was considered one of the great inventions in scientific computing long before deep learning made it mainstream. I have my own reverse-mode AD library and I recently used it for computing the constraint Jacobians for a multi-body simulator and for deriving the discrete Euler-Lagrange equations of motion for a given Lagrangian that's written as a complex expression across several coordinate systems. The Euler-Lagrange equations are then fed to a symplectic integrator to simulate the system. In a problem like that, traditionally you'd symbolically derive and then hard-code the expressions for the derivatives, but they get very gnarly to the point where you need specialized software just to derive the symbolic expressions. AD is great for rapid prototyping and experimentation in applications like that, and once things are settled you can use code generation to lock it down for efficiency and stand-alone use without the AD library.

darawk · on Feb 9, 2019

This seems almost tautologically false to me. This seems to imply that if you implement SHA-256 in Julia, you can differentiate it and solve it backwards via SGD. Is there something i'm missing here? There must be some limitations.

nextos · on Feb 9, 2019

Well, the limitations are the usual limitations in mathematical analysis. Despite being differentiable, if your function has a very complicated structure you will get stuck in local minima with high probability.

People are referring to limitations regarding expressiveness.

cs702 · on Feb 9, 2019

Yes, of course there are limitations. But those limitations are not imposed by the language or the tools; they're imposed by the limits of mathematics.

In hindsight, perhaps my language was too fast and loose. Thanks for pointing that out.

dnautics · on Feb 11, 2019

Sha-256 has integer and galois field operations, which are not generally differentiable in the numeric sense, although the NSA has a strong differentiable cryptanalysis program where they presumably do magic tricks like approximating these integer functions as real valued, so what you are saying might be possible.

ccallad · on Feb 9, 2019

All Julia code? Great! So ccall, PyCall, JavaCall all other language bridge calls can be ADed? I have been hoping to get dual numbers passing through LAPACK, so I am happy to hear that I can do that now.

KenoFischer · on Feb 9, 2019

Hi folks, this is part of of larger effort in the Julia community to do a ground up re-think of infrastructure for machine learning. You can find an overview of the entire effort in our recent blog post [1]. Happy to answer questions.

[1] https://julialang.org/blog/2018/12/ml-language-compiler

chrispeel · on Feb 9, 2019

As I recall, after the publication of [1] on Julia+TPUs there was some talk about doing a comparison of some ML models on the CPU, GPU, and TPU. Is this something you're working on?

[1] https://arxiv.org/pdf/1810.09868.pdf

KenoFischer · on Feb 9, 2019

We have these numbers now and the results are quite encouraging (basically on par with tuned tensorflow for the same model, while retaining significant flexibility). At the moment, we're working with Google on stability and waiting for a new public release of the tpu software stack to enable multicore support for non-TF frontends.

infogulch · on Feb 9, 2019

Is this related to previous discussion The Simple Essence of Automatic Differentiation [1]?

[1]: https://news.ycombinator.com/item?id=18306860