Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Differentiable Control Problems (fluxml.ai)
72 points by dunefox on Aug 27, 2020 | hide | past | favorite | 27 comments


> In contrast, running the neural network takes 5μs (twenty thousand times faster) with only a small loss in accuracy. This “approximate function inversion via gradients” trick is a very general one that can not only be used with dynamical systems, but also lies behind the fast style transfer algorithm.

Very interesting! Follow up question -- how would you choose the network architecture?


Since the network only acts on a small portion of the entire system, we can constrain it in such a way that dramatically simple NNs work just fine.

`FastChain(FastDense(3,32,tanh), FastDense(32,32,tanh), FastDense(32,2))` (from [0]) would take three inputs from your basis, run it through one hidden layer and provide you with two trained parameters.

This [1] example uses two hidden layers, its one of the more complex solutions I've seen so far. To move to this complexity from a simpler chain, we first make sure our solution is not in a local minima [2], then proceed to increase the parameter count if the NN fails to converge.

[0] https://diffeqflux.sciml.ai/dev/FastChain/ [1] https://github.com/ChrisRackauckas/universal_differential_eq... [2] https://diffeqflux.sciml.ai/dev/examples/local_minima/


Isn't this just optimization of control parameters?

What is the comparison with existing control engineering methods like PID tuning, MPC and optimal control?


I think you are right, in the case of the trebuchet, it just computes a black box approximation of the inverse of the system. The difference being that an analytic inversion would solve the problem for all wind and target conditions, while the NN solution will only work for for value ranges used to obtain the data that trained the NN.

In the case of the inverted pendulum, again the disadvantage of using NN is the black-bock nature of the control algorithm. As a control engineer this gives me the chills, because using black-box algorithms tell you nothing about the robustness of the closed loop system. With model-based control, at least we have strong mathematical tools to guarantee that the closed loop will be robust enough to handle variations outside the data that we used for training (our model). With black-box algorithms like NN you have no guarantees. I would not get into a plane controlled by a NN for sure, look what happened with the 737MAX when software engineers thought they could solve dynamical system problems.


To respond to both the parent question, and this comment: indeed, this is black-box optimal control in essence.

However, this method is just one small aspect of the SciML [0] ecosystem now. The article is a little outdated in that sense.

Once obtaining your NN control parameter, it's now possible to use Sparse Identification of Nonlinear Dynamics (SINDy) on that parameter to recover equations of motion governing it [1].

The real promise of these methods is to use the universal approximator power of NNs to get around the 'curse of dimensionality' & uncover presently unknown representations of motion within any system. Take a look at [2] for a more detailed description.

[0]: https://sciml.ai/ [1]: https://datadriven.sciml.ai/dev/sparse_identification/sindy/ [2]: https://arxiv.org/abs/2001.04385


"The real promise of these methods is to use the universal approximator power of NNs...", still if one is to use a grey-box non-linear model dx/dt = F(x, u, t), why use NNs to characterize F? I would be more comfortable using a polynomial to characterize non-linearity than a "deep" black-box.

Polynomials are much easier to "train" because it is just one linear regression with no iteration. It has also been hinted that NN are in essence polynomial regressions [0]. Furthermore, most activation functions are base on e^x where the actual implementation of e^x in a computer is again a polynomial!

[0] https://arxiv.org/abs/1806.06850


Gradient descent is already about as easy a training method as can be. Just a little freshman calculus and programmers can do the "state of the art" optimization of modern times. It's also scalable. If your polynomial regression gets too large because of the model complexity (for comparison, typical deep networks can have millions of parameters) you can't invert your matrix and probably end up using a similar method anyway.

I would have thought a computer uses tables to compute e^x. There's also piecewise linear activation functions that are trivially easy to compute gradients of.

The whole "universal approximation" perspective is pretty vague to begin with. I'd say generally people don't understand why NN's work as well as they do. Previously theorists expected they would need a lot more training data to work, given their complexity. So it's driven to a large degree by empirical success. I am certainly really interested to see people accomplishing the same things with less sophisticated methods, since there is no doubt it has been overused/hyped in some areas just to make the papers and proposals sexier.


> The whole "universal approximation" perspective is pretty vague to begin with

Multiple times this. This claim gets trotted around frequently to showcase superiority of NNs.

At best this is a red herring at worst it is dishonest. The problem is they aren't the only universal approximators. There is a whole slew of them, nearest neighbor approximators, polynomials, rational splines, kernel methods … Furthermore the universal approximation property holds under conditions.

Finally, the ability to represent a function arbitrarily well (approximation property) does not mean that one will be able to find the representation from data easily (learning property). Empirical evidence suggests that among the class of universal approximators we know, NNs seems easy to train effectively. Why this is so s not quite well understood.


wavelets, sum of exponentials, fourier, ... I just mentioned polynomials because they are easiest. But people just jump into the NN bandwagon to get attention. Truth is that is just another tool, and a good engineer has to choose the best tool form the toolbox and not just pick the hammer everytime.


For reference, the DiffEqFlux library has a bunch of classical basis layers [1] and ways to tensor product them [2] for this reason. The real answer as to when to use a neural network is quite complicated [3], but in summary the results all point to the fact that for approximating an R^n -> R^m function, one only needs polynomially many parameters in order to do it well (as proven in a few cases like in that linked paper for "any case where Monte Carlo algorithms are not exponential in dimension"). Tensor products of classical basis functions have to cover every combination of terms (sin(ix)*sin(jy)) so they naturally grow like p^n if you have p parameters in each dimension, so this exponential parameter growth is the curse of dimensionality and this polynomial growth is the formal way of describing how neural networks overcome the curse of dimensionality. So what is useful can depend on a number of factors (another property is the isotropy of the function you're trying to approximate), but this asymptotic property is what makes neural networks a good tool in the high dimensional world where they are commonly used. That makes them quite good as well for things like feedback controllers of larger ODE systems. But yes, in smaller dimensional cases Fourier basis and such are good choices.

    [1] https://diffeqflux.sciml.ai/dev/layers/BasisLayers/
    [2] https://diffeqflux.sciml.ai/dev/layers/TensorLayer/
    [3] https://arxiv.org/abs/1908.10828


Fitting to sum of N exponentials is also a linear problem with no iterations https://math.stackexchange.com/questions/1428566/fit-sum-of-...



Yes! Thank you Sir! I can see you know what you are talking about. This is my point, NNs are very useful for some problems, for others they are not worth the complexity and black-box nature.


For polynomial regression of the type y = p0 + p1x + p2x^2 + ... + pnx^n, the "training" algorithm is linear least squares (no need of gradient descent). Assuming you have data (y, x), the explicit least squares solution is P = pinv(X) Y, see:

y = [1, x, x^2, ... x^n][p0, p1, p2, ..., pn]^T = XP

XP = y

(X^T X)P = (X^T) y

P = (X^T * X)^-1 * (X^T) * y

(X^T * X)^-1 * (X^T) is called the pseudo-inverse of X (which contains all your data). No need of iterations. A similar solution is found for multi-variable polynomials i.e. y = f(x1, x2, ..., xm) where f is a polynomial containing combinations of the independent variables xm and their powers.


> ... (X^T * X)^-1 ...

This is the matrix inversion I was referring to. It's size (at best) depends on the smaller of the number of parameters and the amount of training samples. Both get very big in machine learning. When this happens you need to use some kind of low-memory iterative method like Greville's algorithm or even gradient descent itself. So you're ultimately not any better off.


In practice one computes (X^T * X)^-1 * (X^T) in one go using Singular Value Decomposition, for which very efficient algorithms exists. But if there is really a lot of data, then recursive linear least squares can be used, to partition the larger least squares into smaller pieces. But then again, you just make one pass on the data, not multiple passes, like with gradient descent.


Interestingly (another empirical result that's poorly understood), with stochastic gradient descent, convergence often only requires one pass through the data, if not it might take a small number.

And yes this field only exists because we are presuming a really large amount of data, which often can't even fit on the same hard drive. And a really complex model.

Older kernel methods basically do what you want for tractable datasets. They can do very high-order polynomials, and also add the ability to regularize the solution various ways. Though again, I would be interested in seeing those methods compared to a simple least-squares fit as you propose, which people often didn't do even back when kernel methods were all the rage.


Polynomials (or rather multinomials) suffer from the curse of dimensionality badly when needing more terms (look how taylor series terms explode). Neural networks do better. The fact that a neural network is computed using polynomials is irrelevant since the way the NN is parametrized is different from a sum of a basis of polynomials. You can inspect the vector field to proof certain properties of the neural network. SINDy is already mentioned in another reply.


Yeah, this is the crux. Here's a comment from one of the devs when I asked about the polynomial vs NN basis:

The answer is quite simple really. Classical basis functions suffer from the curse of dimensionality because if you tensor product polynomial basis functions or things like Fourier basis, with N basis functions in each direction, then you have N^d parameters that are required in order to handle every combination `sin(x) + sin(2x) + ... + sin(y) + sin(2y) + ... + sin(x)sin(y) + sin(2x)sin(y) + ....`

Neural networks only grow polynomially with dimensional, so at around 8 dimensional objects it becomes more efficient. In fact, this is why we have https://diffeqflux.sciml.ai/dev/layers/BasisLayers/


Polynomials are just an example, the easiest one. The point is that there are many more universal approximators (as some other user commented here), many of them much more suitable for control applications than NNs.


I'm not entirely confident in answering that directly, so perhaps you can check my reasoning here.

If F is completely unknown, perhaps you start training with a 10 dimensional polynomial basis. What is the (computational) cost of obtaining your solution? Once you have it, will this polynomial accurately represent your system in any real world manner? Perhaps higher order parameters are needed to approximate trigonometric functions - are you able to easily add such functions to your training basis? If not - then your basis could be too restrictive to provide you with a minimal implementation of your control variable.

It looks like you work with this stuff far more than I have, so perhaps that's not an adequate answer.

Another way to look at this though: If you only wanted to characterise your system with polynomials, UODEs + SINDy can do this for you - the NN is simply the optimisation method that's in place of any other optimisation algorithm.


The computational cost of "training" a polynomial would be the same as just one iteration of the training algorithm used by typical NNs. When it comes to trig functions, the story is the same as with the exp function e(). When you call the sin() or the cos() functions in your favorite language, in the end it uses taylor series (polynomials) to compute it (plus some hacks to add precision on certain ranges of the function and to overcome some floating point precision limitations).

The degree at which a polynomial model would fit the real world system has to be validated against data, just the same as with NNs. What does one do when an NN fit is not good enough or too good (overfitting)? One adds or removes layers. Same with polynomials, one increases or decreases degrees.

Sorry for the rant, I am not saying NNs are useless, because I do believe they are super useful for certain problems, specially for categorization. But it seems to me that now a days there is this trend of using NNs as a hammer, and not all problems are nails. Specially when it comes to control, and lives or big economic losses are at stake, it is the responsibility of the engineer to resist the fuzz and craze and use the right tool for the problem.


There is a saying in mathematics that the fastest way to a solution is through the complex plane. This was discovered because a lot of proofs are nicer by doing analytical continuation and analyzing the properties of the continued function. Complex-step differentiation is another example of this.

In some sense, something similar applies to neural networks in this context. Have you done a lot of fitting of classical basis methods inside of differential equations? They are very prone to local minima, so direct training of polynomials inside of a differential equation is rather hard. But through neural network magic, somehow related to [1], which essentially state that local minima are the global minima on large enough neural networks. So this lets you get pretty lazy and just do local optimization to find missing functions, and then sparsify to polynomials later, in a way where the optimization is better behaved than going directly to polynomials. The DiffEqFlux library has both approaches available, so you can try both side by side and see the difference. From years of experience doing the former, the latter is quite a breath of fresh air.

   [1] https://arxiv.org/abs/1412.0233


> The computational cost of "training" a polynomial would be the same as just one iteration of the training algorithm used by typical NNs.

That statement depends heavily on the dimensionality of the problem. Polynomials also have huge problems with discontinuities (even in some higher order derivative) sometimes would require an infinite number of polynomials to smooth out the errors around the discontinuities. (try to fit the Integral of |x| with polynomials)

Fear of NN in control is justified if the networks are poorly understood.


Not just that, they tend to blow up when one extrapolates 'too' far from data. This can be controlled for using other basis functions, for example functions in a reproducing kernel Hilbert space, radial basis functions. It is best to choose the basis based upon the data (as RBFs and RKHS bases do) and not chose a basis independent of the data. This applies for polynomials too, choosing a polynomial basis that's orthogonal with respect to the data distribution makes computations much better behaved -- otherwise its common to run into ill conditioned problems that are very sensitive to noise in the data.


I certainly agree with the NNs are used as hammers point. Until coming across the UODE concept I was of the opinion they were more parlour trick than anything useful. Here though, I could see some validity.

These comments are appreciated - I think a discussion like this is lacking in the SciML docs (or at least not visible enough). Will have a chat with some of the devs and see if there's something we can add.


I assume you are talking about the pendulum. PID tuning is fundamentally restricted to holding set points and can't deal with non-linearites to reach those set points.

MPC if solved over a to small time horizon might fail to find a windup trajectory. But that isn't a restriction. However i would consider MPC overblown for this, as MPC is usually a series of optimal control problems.

By optimal control i assume you mean things approaches where you optimize a u() so it minimizes some integral. This approach does just that, only that u is not parametrized by polynomials/splines but by a neural network, taking state, time and other things into account.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: