Hacker News new | past | comments | ask | show | jobs | submit login
Differentiable programming from scratch (thenumb.at)
196 points by TheNumbat on Aug 1, 2022 | hide | past | favorite | 107 comments



One thing that I've thought about, when using calculus in programs, you're often dealing with a very small (but finite) Δx, rather than an infinitesimal 𝛿x.

And I don't think that the common differential equation is the ideal form when dealing with that situation.

(Take for example g(x) = -f(-x), the gradients at x and -x are not equivalent for Δx, but would be for 𝛿x.)

Anyway,

    (limit(h -> 0)((f(x + h) - f(x))/h) 
Why do we even need h? It's just (the infinitesimal) 𝛿x multiplied by an arbitrary finite constant.

(You're dealing with a linear equivalent infinitesimal subsection of the graph - so 𝛿x and 𝛿y scale linearly with each other.)

So remove h:

    ((f(x + 𝛿x) - f(x))/𝛿x)
Looks cleaner.

    (f(x + 𝛿x) - f(x))/𝛿x ≡ (f(x) - f(x - 𝛿x))/𝛿x ≡ (f(x + 𝛿x) - f(x - 𝛿x))/(2 * 𝛿x)
You can take the gradient from x to (x + 𝛿x) or from (x - 𝛿x) to x, or from (x - 𝛿x) to (x + 𝛿x).

(The cube-root of the other three forms is also a valid differential equation, if you want to be evil about it.)

    (f(x + 𝛿x) - f(x - 𝛿x))/(2 * 𝛿x)
(That's a nicer form for programmatic use; resolving the earlier mentioned gradient issue when dealing with finite deltas.)

Another thing I noticed: the standard (h -> 0) form eliminates all parts of the gradient containing 𝛿x values - which is fine for infinitesimal 𝛿x, but is less ideal for finite Δx.


You really just derived the central derivative. Yes, this derivative is more accurate. The issue is that you can't use it on the edges. For the boundaries you're stuck with lower precision.

Here's a resource you might be interested in http://www2.math.umd.edu/~dlevy/classes/amsc466/lecture-note...


The central derivative is equal to Im(f(x + j h) / h) where j is the split-complex number which squares to 1, "Im" is the imaginary-part function, and h is a small real number. When j is replaced with the dual number epsilon, this instead equals the actual derivative, as is shown in the article. You can also replace j with the complex number i, producing the complex-step method.

Central difference and complex-step are both bad approximations to the actual derivative. Of these, the complex-step method is the most ridiculous because it requires extending a function to act on complex numbers, while not extending it to the closely-related dual numbers.


"Ridiculous" is not a word I'd use... sometimes symbolic derivatives are intractable, in which case complex step derivatives provide high-precision numerical derivatives which are otherwise unattainable.

https://vladfeinberg.com/2020/07/05/discrete-residues.html


No. The dual numbers are very similar to the complex numbers and produce exact results. Ridiculous is true and fair.

[edit] Also, I wouldn't count using Cauchy's Integral Formula as the complex-step method.


This!

I always find it curious how computer scientists end up rediscovering things known for decades, and then get all hyped about it ...


I figure it's because math people aren't great at communication with people outside their domain. Point in case being Wikipedia articles on math being absolute gobledygook to non-math people.


To be fair, I can't get computer scientists to read any math books. But I don't think this is quite odd. The only people I know that go out and read math books are mostly (current or former) mathematicians or physicists (which a random sprinkling of others). I'm not sure it is exactly mathematicians faults that your third grade teacher impressed upon you that math is hard and useless.


I don't think math is hard or useless, thank you very much. However if I'm programming a game and want to project a point onto a line I'd prefer to not get told by math folks that I first need to learn all of linear algebra before they can get their solution explained to me. I'm not stupid, I understand plenty of math, I'm just not steeped in the jargon.


I feel like you're responding as if I attacked you. I'm not sure why. Maybe it was the comment about not being able to get CS people to read math books but you do? I don't mean that statement in an absolute way (there's no absolutes) but rather as a general case. Of course there are CS people that read math books. It is just my experience that the majority aren't interested in it and I don't think this is an uncommon experience. If you're different, good for you, but don't take general statements personally. I'm not calling you stupid nor am I saying you don't have an interest in math. You're an individual person.


"I'm not sure it is exactly mathematicians faults that your third grade teacher impressed upon you that math is hard and useless." doesn't exactly strike me as a general statement, but perhaps I overreacted.


If you're interested in how to best calculate derivatives numerically using a finite difference rather than an infinitesimal, try having a look at the Wikipedia page on numerical differentiation. The punchline is that you can control the error in f'(x) much better if you use many displacements, ie. f'(x) = a1 f(x + b1 h) + a2 f(x + b2 h) + a3 f(x + b3 h)..., where the bi are chosen and the ai depend on the bi and h.


Such methods are doomed. Consider the function

  f(x) = a sin(x / a)
for very small a. Finite differencing methods would approximate f'(0) as 0 as f(x) is very close to 0, but the actual value of f'(0) is 1.

Differentiation is a discontinuous operator, and cannot be accurately estimated using numerical methods. Only exact methods, which include the dual numbers, can produce accurate results.


Finite difference derivatives are the basis for a range of numerical methods for solving differential equations. Given that these techniques underpin various areas of modern engineering, I'd say your statement that 'differentiation [...] cannot be accurately estimated using numerical methods' is demonstrably incorrect. I think you're fishing for something like 'there exist functions whose derivatives cannot be accurately estimated numerically'. Presumably some assumptions have been made which rule out such functions in using finite difference schemes in practical applications.


> Finite difference derivatives are the basis for a range of numerical methods for solving differential equations.

I thought the basis usually involved rewriting the equation in terms of integrals, which can be estimated using quadrature. Numerical quadrature does not have the same problems as finite differencing because integration is a continuous operator. How accurate are these methods you're proposing? Could they be improved using autodiff (via the dual numbers or via reverse-mode autodiff)? Do you have a reference, so I can understand your point better? I'm sceptical and I'm not inclined to believe you yet.


Well, there are many different methods and certainly some are based on numerical integration as you say. The most simple method based on finite difference derivatives is Euler's method which uses the scheme given by OP, with error O(h). This can be extended to Runge-Kutta methods which use a finite difference with n terms and have error O(h^(n-1)), and which to my knowledge are common and widely used to solve ODEs. You can read about Runge-Kutta methods on Wikipedia, and they also have a page on numerical methods for ODEs which gives an overview of different methods.

I don't know what autodiff is, or how the dual numbers fit in here. I'd be interested to learn if you want to explain


> I don't know what autodiff is, or how the dual numbers fit in here. I'd be interested to learn if you want to explain

Autodiff is an algorithm for computing derivatives in areas like Deep Learning. It's an exact method with two main variants -- forward mode and reverse mode -- differing in terms of when each of them is more efficient. It's implemented in widely-used numerical libraries like PyTorch and TensorFlow. The simplest implementations use the dual numbers to implement the forward mode variant. More sophisticated implementations implement the reverse mode variant. Wikipedia has plenty on it.


Ok I had a look at wiki and it's clear to me how the dual numbers encode differentiation and the theory behind how autodiff works. How do the two fit together? I didn't follow that part


You define a dual number type in your programming language of choice, and overload the operators +, -, *, /, <, >, sin, cos, etc. Then you take a function which computes a numerical function f which is defined using the previous operations, and simply evaluate them on a dual number. It's "automatic" because, assuming the right programming language, a user-defined numerical function will often be able to take in many different kinds of numbers, like complex numbers, real numbers, arbitrary precision floats, and even user-defined classes like the dual numbers.


Ah thankyou, it was so neat and automatic that I missed the punchline. It looks like you'll always calculate f and f' together and it's not possible to calculate only f' though? Seems like this could waste time in some applications?

I can see how this is the solution to the problem of calculating the numerical derivative of a known function composed of finite combinations of rational functions and a set of specified functions, elementary or special. I don't see how it's relevant to solving ODEs numerically though, given that in that case you don't know the functional form of f or f'


> I can see how this is the solution to the problem of calculating the numerical derivative of a known function composed of finite combinations of rational functions and a set of specified functions, elementary or special.

Yes and no. It's worth taking into account that you can apply it to some functions defined using conditions and loops. For instance, if f is the sqrt function defined using Newton's method, then autodiff via the dual numbers will compute an estimate for f'(x) = 1/(2 * sqrt(x)). More explicitly:

  def f(a):
    x = 1
    oldx = 2
    threshold = 1e-10
    while abs(x - oldx) < threshold:
      oldx, x = x, (x + a/x) / 2
    return x
^ You can evaluate f on a dual number. This algorithm contains conditions and loops.


This was an interesting comment and I learnt something new, thank you. It does feel a little bit like you are proselytising the good news of the dual number autodiff method though, when you don't respond to questions on potential downsides and only add information about positives.


I'm more familiar with using differentiation in optimisation, where autodiff seems like the best method. I'm also aware of the fact that the differentiation operator is discontinuous, which makes me sceptical of finite-differencing methods. I'm not familiar with numerically solving differential equations, but the natural connection between Runge-Kutta and finite differencing seems to be an interesting one.


This is not true. Doomed is sensational. With proper perturbation analysis, one can choose an appropriate stencil and formula that has desired error bounds.

https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72...

https://www.google.com/search?q=finite+difference+stencil+de...


Won't work on my example. Given any differentiable function f:R -> R, you can add a very small function to it that dramatically changes its derivative. But maybe this is too pathological to bother about.

Over the complex numbers, this phenomenon doesn't happen. This can be shown, for instance, using the maximum modulus principle. You can therefore use Cauchy's Integral Formula to accurately estimate derivatives. But why not then just use the dual numbers?


I see your edit - fair point, analyticity does make complex functions much more well-behaved than their real counterparts. Still, this isn't a death knell like you previously claimed.

Let's put up a bet. I win the bet if I provide a polynomial approximation of f'(x), R[f(x)], bounded by a maximum of 5% error, on the interval (-1, 1), by November 1st, 2022. If I fail to, or resign by then, you win.

If I win, you agree to silence yourself for 1 year, regarding the split-complex and dual numbers, no longer touting them as a panacea.

If I lose, I'll agree to make a YouTube video extolling the virtues of a mathematical subject of your choice, such as these types of numbers.


I don't fully understand your bet (and I don't take kindly to being asked to silence myself). I don't have time at the moment to craft an example. I might play around a bit with the edges of floating point. Differentiation is indeed discontinuous, so with enough determination I should be able to win such a bet. My function would probably be a sigmoid function which would go from -a to +a (for small a) over a very small interval [-b,b], where b can equal a.

> If I win, you agree to silence yourself for 1 year, regarding the split-complex and dual numbers, no longer touting them as a panacea.

I never said these numbers were a panacea. I said the dual numbers were an exact method. In this comment [1], which is what you're referring to, I said that the connection with other planar algebras was cute, but you mischaracterised what I said, which was rude. The algebras are pertinent because the article mentioned the dual numbers.

[1] - https://news.ycombinator.com/item?id=32305644


Fair enough. The bet I had in mind was the function you'd already stated as intractable: f(x) = a sin(x / a), the value of a is up to you, within reason (no IEEE fp shenanigans.)

If you'd rather some other consequence than silencing, please suggest one. I am tired of hearing your panaceas. Symbolic differentiation is also exact but intractible. What makes the duals yield exactness in any way that isn't prone to the same finitude as other approximations?

I see a lot of beasts but only one individual claiming to tame them with ease and exactness.

If it comes across as rude, that's not my intent. Moreso to have some clarity and to have your claims be validated.


I'm not going to accept your bet because I don't understand its conditions (and what's the point?), but I accept your challenge. Also, I'm going to f*** around bigly with floating point.

  def f(x,a):
    return 1 + a * sin(x / a)
Now let epsilon = 2.2204460492503133e-36. Consider f(x, epsilon).

The function should always return 1 in floating point, so any finite differencing method will estimate its derivative as 0. But over the dual numbers, the function should have f(e, epsilon) = e, where e is the dual number imaginary. In other words, the dual numbers return the correct value of the derivative at x=0, which is 1. You should be able to use any Python implementation of the dual numbers; my own one uses Sympy. Actually, you might be able to use Scipy and represent dual numbers as matrices.

Look up automatic differentiation because you're not aware of it, and it is the subject of the article. You seem to misunderstand symbol methods. The problem of finding an exact derivative at a single point is not as hard as finding the symbolic derivative everywhere. You're referring to the product rule and chain rule, but these are not needed to find derivatives at single points.


I am aware of AD. All AD does is use the chain rule on the Taylor (or otherwise) approximations of all arithmetic functions used in the calculation of the target function. It's symbolic but only at the single arithmetic operation level. The chain rule is concealed under rules on epsilon. The reason epsilon's square equals 0 is to eliminate any higher order derivatives as soon as they appear. If the derivative is abnormal in a way such that the normal truncated calculative formulas would conceal it, then AD would fail to give an accurate result as well.


> If the derivative is abnormal in a way such that the normal truncated calculative formulas would conceal it, then AD would fail to give an accurate result as well.

The following is proof of otherwise. First implement the dual numbers by borrowing the code from here [1]. Then define the sine function in a crude way:

  from math import factorial
  
  def sin(x):
      return sum((-1)**i * x**(2*i + 1) / factorial(i) for i in range(100))
We then define f:

  def f(x, a=1e-36):
      return 1 + a * sin(x / a)
and use the dual numbers to estimate the derivative at 0:

  In [49]: print(f(Dual(0,{'epsilon':1})))
  f = 1.0
  fepsilon = 1.0
We get 1 as the estimate, which is correct! For comparison, we may try using finite differencing to estimate the derivative at 0:

  In [50]: h = 1e-36; (f(h) - f(0))/h
  Out[50]: 0.0
The estimate for the derivative is 0 here, which is wrong. The finite differencing cannot be improved using stencil theory because the estimate of f(x) is always exactly 1 under floating point, and therefore the estimate for f'(x) is always exactly 0. Dual numbers win.

[1] - https://github.com/ujjwalkhandelwal/Dual-numbers-and-automat...


You've just described the dual numbers, which provide a way of implementing forward-mode autodiff.

  > If the derivative is abnormal in a way such that the normal
  > truncated calculative formulas would conceal it, then AD
  > would fail to give an accurate result as well.
What I'm seeing directly contradicts that. I've tested the example above (the one I called f(x,a) with a=1e-36, trying to find its derivative at x=0), and it gave the right answer with dual numbers, but not with finite differencing. So what you're saying isn't true. I'll post a minimal implementation here later.

[edit]

I've used Sympy out of laziness, which is inelegant. I don't know how clear this will be.

First of all, I have to change the number 1 in the definition of f(x,a) into a matrix. We need this because we'll be representing dual numbers as matrices, and you can't add the scalar 1 to a matrix:

  one = eye(2)
  
  def f(x, a=1e-36):
    return one + a * sin(x / a)
Sympy doesn't have its own implementation of `sin`, so we'll need to provide one:

  def sin(M):
    return im(exp(I*M).n())

At the dual number Îľ, we have f(Îľ) = f(0) + Îľ f'(0). The dual number Îľ will be represented as the matrix

  ⎡0  1⎤
  ⎢    ⎥
  ⎣0  0⎦
As code:

  epsilon = Matrix([[0,1],[0,0]])
Evaluating f on the above matrix gives:

  In: f(epsilon)
  Out: 
  ⎡1  1.0⎤
  ⎢      ⎥
  ⎣0   1 ⎦
Which tells us that f(0) ≈ 1 and f'(0) ≈ 1. Both are correct.

If we use finite differencing instead, we get:

  In: (f(h * eye(2)) - f(0 * eye(2)))/h
  Out: 
  ⎡0  0⎤
  ⎢    ⎥
  ⎣0  0⎦
So it's claiming incorrectly that the derivative at 0 is 0. No finite differencing scheme can fix this because floating point causes the function f to become constant.

Autodiff using the dual numbers has pulled off the seemingly impossible.

[edit: An early typo caused the code to produce an incorrect result. Now fixed, and my claim holds.]


> Autodiff using the dual numbers has pulled off the seemingly impossible. It's not seemingly impossible, if you understand that the chain rule for f' is just being executed at the same time as the calculation of f by having derivatives for basic operations already defined.

However, like I said, if you have a calcuation method that hides derivatives in terms that have been truncated, then this will not save you.

(-1)*i * x*(2*i + 1) / factorial(i) is not one of those methods -- and sin is rather regular in regard to its Taylor series. So of course it works out in this case, with a normal power series calculation method.

Try a different series or calculation method, and dual numbers will get you a wildly different result. Understand, dual numbers only work well, when you use a method of calculation that front-loads terms that have high bearing on the derivative. Otherwise the missing terms/truncation, causes severe inaccuracy.

However, stenciling might actually perform better in these scenarios.


> Try a different series or calculation method, and dual numbers will get you a wildly different result

No. The example works because:

  While the exact value of f(x,a) isn't 1, given any inexact representation of real numbers like floating point or fixed point, the value of "a" can be chosen so that f(x,a) has 1 as its closest representation.
Trying to compute f(x,a) differently isn't going to change that, so stencilling methods are never going to work here. But autodiff will always work. This means I win your challenge.

Your other claims are probably gibberish. You need to provide an example.


What don't you understand -- the method of calculating the function is an approximation, thus the AD derivative is dependent on it, and the AD derivative is a derivative of the approximation, not the actual function. Whereas an approximation of the actual derivative is what we are truly after.

> The key take away here is that the map is not the territory. Most nontrivial functions on computers are implemented as some function that that approximates (the map) the mathematical ideal (the territory). Automatic differentiation gives back a completely accurate derivative of the that function (the map) doing the approximation. Furthermore, the accurate derivative of an approximation to the idea (e.g d_my_sin), is less accurate than and approximation to the (ideal) derivative of the ideal (e.g. my_cos). There is no truncation error in the work the AD did; but there is a truncation error in the sense that we are now using a more truncated approximation that we would write ourselves.

https://www.oxinabox.net/2021/02/08/AD-truncation-error.html

AD is great but if you have a calculation method ill-suited for AD, then you'll get shite results. Why is this surprising?

And yeah, stenciling is mostly for PDEs and other state spaces that we don't have a closed form for. It's generally not used for an analytic function. But you can use it for an analytic function if you tailor the stencil to the function. In fact, you'll just yield a truncated Taylor polynomial if you provide a perfect function-specific stencil.


An entire subfield of analysis called Stenciling exists, for this purpose. Depending on the function or differential equation / system, different stencils are used.

Stenciling doesn't just deal with the derivative but it tries to come up with approximations involving a fixed number of sample points, the stencil, for any differential operator, ie. the Laplacian, higher order derivatives, etc.

> The term "stencil" was coined for such patterns to reflect the concept of laying out a stencil in the usual sense over a computational grid to reveal just the numbers needed at a particular step. [2]

[0] https://en.wikipedia.org/wiki/Five-point_stencil

[1] https://en.m.wikipedia.org/wiki/Finite_difference_coefficien...

[2] https://en.wikipedia.org/wiki/Stencil_(numerical_analysis)


Thanks for the links (and the flame war).

Another thing that I'm wondering about is how to calculate the length of a function.

What I'm thinking, is that you can treat a function like a piece of string, you take f(x) and generate a pair of functions, one above and one below f(x).

You generate the functions at a fixed (infinitesimal) tangential distance 𝛿t and generate deltas for {x, y}

    mul = 𝛿t/((𝛿x)^2 + (𝛿y)^2)^.5
    Δx = 𝛿y * mul
    Δy = 𝛿x * mul

    u(x - Δx) = f(x) + Δy
    l(x + Δx) = f(x) - Δy
The idea being that the function length is the area ∫(u(x) - l(x)) divided by the thickness 2 * 𝛿t.

I have no idea how wrong this is, but if you can point me toward any resources that'll help when I (inevitably) get stuck, it would be very much appreciated.


Not exactly sure what you are getting at but it does remind me of the Cauchy's Integral Formula which shows that differentiation is equivalent to integration.

https://en.wikipedia.org/wiki/Cauchy%27s_integral_formula


Trying to measure the length of a curve f(x).

I turn the curve into a fixed infinitesimal thickness (2 * 𝛿t) 'rope' and measure the area.

Then I divide the area by the thickness to get the length.

The rope is defined as the area between two curves, both of them manipulations of f(x).

The curves are generated by moving each point (in f(x)) a constant infinitesimal distance at a tangent to the function (this gives the rope its fixed thickness), one in either direction.

For the upper curve {Δx, Δy} = {(n * 𝛿y), -(n * 𝛿x)}, for the lower {(n * 𝛿y), -(n * 𝛿x)}, where n is a normalisation constant 𝛿t/((𝛿x)^2 + (𝛿y)^2)^.5

Once you have these two functions, you can use the integral of the difference to get the area.

That's what I'm thinking anyway.


I'm a brainlet or maybe it's the 10+ hours of work I did today but I just can't grasp it right now.

There are a lot of ways to take integrals though, so I wouldn't doubt your method could work. Generally though the holes in a lot of these integration methods is their inability to work with functions that are pathological: https://en.wikipedia.org/wiki/Pathological_(mathematics)

|group1 = Types of integrals |list1 = * [[Riemann integral]] * [[Lebesgue integration|Lebesgue integral]] * [[Burkill integral]] * [[Bochner integral]] * [[Daniell integral]] * [[Darboux integral]] * [[Henstock–Kurzweil integral]] * [[Haar measure|Haar integral]] * [[Hellinger integral]] * [[Khinchin integral]] * [[Kolmogorov integral]] * [[Lebesgue–Stieltjes integration|Lebesgue–Stieltjes integral]] * [[Pettis integral]] * [[Pfeffer integral]] * [[Riemann–Stieltjes integral]] * [[Regulated integral]]

|group2 = Integration techniques |list2 = * [[Integration by substitution|Substitution]] * [[Trigonometric substitution|Trigonometric]] * [[Euler substitution|Euler]] * [[Weierstrass substitution|Weierstrass]] * [[Integration by parts|By parts]] * [[Integration by partial fractions|Partial fractions]] * [[Integration using Euler's formula|Euler's formula]] * [[Integral of inverse functions|Inverse functions]] * [[Order of integration (calculus)|Changing order]] * [[Integration by reduction formulae|Reduction formulas]] * [[Integration using parametric derivatives|Parametric derivatives]] * [[Leibniz integral rule#Evaluating definite integrals|Differentiation under the integral sign]] * [[Laplace transform#Evaluating improper integrals|Laplace transform]] * [[Contour integration]] * [[Laplace's method]] * [[Numerical integration]] * [[Simpson's rule]] * [[Trapezoidal rule]] * [[Risch algorithm]]


(No idea if you'll ever see this, I also have no damned idea what I'm talking about.)

Looking at the Weierstrass function, it's a chaotically convergent series with a non-convergent gradient series.

(The standard differential forms don't apply, because the assumption of linearisation at the scale of 𝛿x is invalid, due to the scale invariant properties of the fractal.)

In that case the integral function would be even more convergent since the 'noise' (pretty much) adds to zero.

So can you differentiate via integration, and get a gradient for a related function (with less noise)?

I messed around with this, taking the areas of a pair of little triangle approximations and adding them to get a ~rectangle.

Then divide that area by 𝛿x to get 𝛿y, and divide that by 𝛿x to get 𝛿y/𝛿x.

This was the form that I reached:

    let y = f(x)
    let F(x) = ∫f(x)𝛿x
    𝛿y/𝛿x = (F(x + 𝛿x) + F(x - 𝛿x) - 2 * F(x))/((𝛿x) ^ 2)
At very least, it passed my polynomial sanity check:

    let F(x) = x^6
    𝛿y/𝛿x = 30(x^4) + 30(x^2)(𝛿x^2) + 2(𝛿x^4)


That's awesome. Glad to see your method worked. If you ever have a chance to encounter a serious mathematician, they can probably give you the proper name of it. Or perhaps we'll have to name it the Yarg integral :-)


I've just been thinking of it as differentiation via integration.

It took me until the end to realise that all I'd done was derive a form for the second derivative of F(x).


Ah yes, this is why epsilon squared equals 0 in the dual numbers -- to automatically remove all higher order derivatives when they appear. Exponents on derivatives are indicate of their rank/order.


It's the second symmetric derivative.

https://en.wikipedia.org/wiki/Symmetric_derivative#The_secon...

https://en.wikipedia.org/wiki/Second_derivative#Limit

> The limit is called the second symmetric derivative. Note that the second symmetric derivative may exist even when the (usual) second derivative does not.

> This limit can be viewed as a continuous version of the second difference for sequences.

So it actually can do what I intended it to do, nothing original - but there was a degree of satisfaction in deriving it myself.


Thanks yarg, I just learned something new.


Is h use instead of 𝛿 or Δ, because it is applicable to both partial and normal differential. 𝛿 Is always a partial symbol to me.

d in dx whilst is not the same as h, as it stands its own way now as “operator”.


No, the h here is a number, not an operator.


I just find that it buries the lede.


I find differentiable programming languages really fascinating. Think about this: a differentiable programming language is still a programming language. If the language is designed to facilitate a smooth optimization landscape, it's actually possible to "learn" programs with gradient descent. This opens the door to a lot of cool possibilities:

- programming languages which use neural networks as primitive functions (think `result = sum([mlp(input) for input in list])`. NN's are (understandably) notoriously bad at learning simple operators [1]. Differentiable programming over a language defined by aggregation functions (map/fold/sum/mean/etc.) allows us to bypass learning some simple functions.

- Flipping this around, we can use neural networks that use differentiable programs to regularize the outputs. Assume we have a NN that learns the speed of a car from a video. We know that a car's speed cannot exceed (say) 200mph. Make a differentiable program to express this and use it to regularize the output of the network.

- Reusing the image->NN->speed example again, use the differentiable program to identify speeds/conditions where using a neural network policy is unsafe and switch to a (less-performant) handmade policy instead.

Some more thoughts about this: https://atharvas.prose.sh/differentiable_dsls

[1] https://dselsam.github.io/posts/2018-09-16-neural-networks-o...


Two points,

(1) everything which makes programs useful is impure device access and state change, discretely sequenced over time

(2) grad. desc. et al. do not learn discrete constraints (hence why NNs are bad at learning operators: they cant. x+x is defined fa. x; not fa x. in the training set).


> everything which makes programs useful is impure device access and state change, discretely sequenced over time

I haven't heard about this before actually. I'd love to hear more about this! "Impure," here, is PL terminology for functions that affect global state/arguments when you run them. right? So, brainstorming a bit, what this means is that making a diff. programming language that treats a NN module as a pure function won't actually be beneficial? I'm not sure if I'm drawing the correct conclusion but this is a really interesting point. Don't have an answer for this (yet!).

> grad. desc. et al. do not learn discrete constraints

Great Point! To push back a little on this. You're right that any discrete constraint will always mess up the smoothness of the function (eg: less-than-g is not smooth at x=g). However, we can engineer our way around this by relaxing a discrete constraint to its closest smooth approximation! So, we can implement the less-than-g function as a sigmoid that is shifted by +/-g. This introduces a parameter to control the slope of the sigmoid. In practice, I haven't had much difficulty learning programs even with a really steep slope for the sigmoid.


(1) Yes, the modern ML/AI lot seem to ambiguously use a purely mathematical meaning to "computer" -- which is useless. As useless as any pure mathematics. If we only had this a "computer" would be a theoretical curiosity, like a 200-dim sphere.

The real-world computers we care about run algorithms whose semantics is given by the properties of the devices real computers use. This double meaning to "computer" has caused a lot of superstition in the ML/AI space.

Real computers are engineering devices which shuffle electrical signals around to useful devices.

There is no reason to think that "pure algorithms" have any use at all, as with, eg., a 200-dim sphere. They're only useful if they can be given a semantics which exploits useful properties of devices. (cf. with physics, where a 200-dim sphere could be useful if it models some actual system).

(2) This isn't enough. Consider learning the rules of chess; or likewise, the inference rules of mathematics. f(x) = 2x^2, f'(x) = 4x, etc.

Search spaces constructed for a grad. desc. search are very infinite; and the solutions we need are infinitely precise. Discrete approaches to search(ing for solutions) are necessary.


> As useless as any pure mathematics

Are you hearing yourself talk? Do you know why you have (to just name one example out of many) thousands of pictures on your phone, and not just a few? Because of pure mathematics. Because of compression: Even JPEG2000 from back in the day uses intricate and beautiful compression algorithms based on wavelets.


I think this is partially true; there is some support for logical statements and control flow in differentiable programs -- at least in Jax. Further, think Deep Mind have a recent paper on a DL sequence learning methodology able to learn control strategies for lots of games simultaneously. I think this is a good example of learning discrete constraints with an NN.


Two counter-points (appreciate some counter-counter-points :):

1) The discrete sequencing is an epiphenomenon. The underlying processes are continuous changes in voltage and current flows. (I'm not sure if Planck scale considerations can throw a wrench in this though. Would love to be educated here.)

2) Our brains do not have ostensibly discrete neural processors. I don't think gradient descent is comparable to how the brain learns, but I think there is some reason to think that it is possible to learn symbolic processes in spite of having a processor that isn't especially built for it.


You're making a genetic fallacy here: that since the origin/ground of something has property C, it's product must have it too. This isnt so.

I agree that reality is fundamentally continuous. However cognition isnt; and many things arent.

A frequency is discrete. A length is continuous. These properties aren't eliminable for one another.

Here, whilst i'd agree that all physical process going on (everywhere) have essential continuous properties; they also have essential discrete ones. The issue is that grad. desc. alone does not give you the right kind of discrete ones.


Thank you for sharing, this was a great article!

I had to implement autodiff from scratch for Nx (https://github.com/elixir-nx/nx) about 18 months ago and this would have helped me so much! I spent roughly a month chasing a bug because I was not handling the case of duplicated incoming nodes and you outline it very clearly. I will be recommending the article from now on to anyone who needs to understand how autodiff works.


A pet peeve of mine is that differentiable programming is co-opted almost entirely by deep learning + neural networks. The idea of differentiable programming is much bigger than SGD, and in fact neural networks are typically a simple program to differentiate. Full differentiable programming requires solving much more involved problems around control flow than just implementing numerical forward/reverse mode for math operations with well defined and understood gradients.


Nice! I asked something about this a while ago:

https://news.ycombinator.com/item?id=31044118


Thanks for the link. Interesting conversations and anecdotes.

Don’t worry about the fact that you don’t understand anything. Just keep trying to get results. If you focus on getting results, the understanding will come to you naturally. It’s how I learned ML.

Watching presentations is nice, but tinkering with working code is so much nicer. It’s no surprise you came away feeling like you don’t actually know what you thought you knew — the only way to understand it is to go get some experience.

Also don’t worry that it takes a long time to understand something. This stuff is inherently hard. (My answer to your post is “Yes, of course I feel that way too! I don’t understand a damn thing in most presentations till I go code it myself.” So don’t feel alone!)


Thanks


You don't need induction for (x+x'e)^n+1. The binomial formula can be applied once the arithmetic on dual numbers is introduced.

> We can use this result to prove the same property for any smooth function f. Examining the Taylor expansion of f at zero (also known as its Maclaurin series):

That's not exactly true. For example, arctan = tan^-1 is smooth on R, but its Taylor series only converges for |x|<=1. However, polynomial series expansions exist for |x|>1, where similar arguments would apply. Also, Taylor series and Maclaurin series aren't the same thing. A Maclaurin series is a Taylor series centered at 0.

Lastly, is "differential programming" a new term? It seems weird to me that the machine learning community continues to reinvent terminology for already existing things. What is wrong with automatic differentiation? (Not necessarily a question for the author.) It's not really a new paradigm in the sense of "logic programming", "object-oriented programming", "functional programming", etc., and is instead just a technique.


The idea behind the term "differential programming" is that many traditional autodiff systems force you into a highly restricted subset of the language or DSL (e.g. no loops, no if statements etc). Differential programming is a term used to describe systems that let you take derivatives of your entire source code. This lets you do things like AD through a simulation or optimization algorithm which can be really powerful for applying ML techniques to domains that are less typically suited to ML approaches (circuits, physics, differential equations etc)


But I would wager that those are poor implementations of automatic differentiation, as the property that automatic differentiation works with for loops, if statements, etc. is inherent to automatic differentiation. So differential programming seems like automatic differentiation just implemented properly.

I've written some simple forward-mode automatic differentiation implementations in a few languages, and it's akin to just using a library. It doesn't seem like a paradigm to me.

Reverse-mode automatic differentiation is basically backpropagation. So again, ML seems to just like to rename things. I studied mathematics and have recently been trying to learn some ML. It's pretty annoying that a lot of mathematical terms seem to have been renamed or coopted for something else.


Autodiff does not work with for loops or if statements. The current solutions effectively pick a few promising traces through the program and then assume that nothing else exists. To handle it more elegantly (for things like preserving equational reasoning or avoiding exponential blowup) you need to address it at the level of language semantics.


> Autodiff does not work with for loops or if statements.

Is that necessarily true? Here is an incomplete automatic differentiation implementation that handles if statements just fine in a function definition. Unless you mean something else.

    type Dual = {Real: float; Epsilon: float} with
        static member (~-) (x: Dual) = {Real = -x.Real; Epsilon = -x.Epsilon}

        static member (+) (x: Dual, y: Dual) = { Real = x.Real + y.Real
                                                 Epsilon = x.Epsilon + y.Epsilon }
        static member (+) (x: Dual, c: float) = {x with Real = x.Real + c}
        static member (+) (c: float, y: Dual) = {y with Real = c + y.Real}

        static member (-) (x: Dual, y: Dual) = x + (-y)
        static member (-) (x: Dual, c: float) = x + (-c)
        static member (-) (c: float, y: Dual) = c + (-y)

        static member (*) (x: Dual, y: Dual) = { Real = x.Real * y.Real
                                                 Epsilon = x.Real * y.Epsilon + x.Epsilon * y.Real }
        static member (*) (c: float, y: Dual) = {Real = c; Epsilon = 0} * y
        static member (*) (x: Dual, c: float) = x * {Real = c; Epsilon = 0}

    let dcos (x: Dual) = {Real = cos x.Real; Epsilon = -(sin x.Real) * x.Epsilon}
    let dsin (x: Dual) = {Real = sin x.Real; Epsilon = (cos x.Real) * x.Epsilon}

    let differentiate (f: Dual -> Dual) a =
        let x = f {Real = a; Epsilon = 1.0}
        x.Epsilon

    let testFunction (x: Dual) = if x.Real < 0.0 then dcos x else dsin (x*x - 3.0*x)
Using that gives:

    > differentiate testFunction -1.0
      0.8414709848078965

    > differentiate testFunction 1.0
      0.4161468365471424

    > differentiate testFunction 0.0
      -3.0
Now, of course, one needs to be careful interpreting the result at a = 0.0. That's because the testFunction is not differentiable at that point due to a jump discontinuity there, but we still get a value back. But as far as I know, this is simply an issue with automatic differentiation in that it only correctly tells you what the derivative is if it exists at the given point.


This "discretizes then differentiates" to borrow terminology from [1] which is one of the more accessible presentations and papers. The program might evaluate correctly, but equational reasoning (like you might want for any kind of automated optimizations) is broken. In a toy example like this where you're doing everything manually then you probably don't care, but for larger systems, it gets tiring to do the mental equivalent of assembly programming.

[1] https://people.csail.mit.edu/sbangaru/projects/teg-2021/


This isn't a toy example, though. It's the start of a library. Once you've developed the dual numbers and differentiate function and defined the dual number versions of all elementary functions, then you have a full (forward-mode) automatic differentiation library that can just be used. You wouldn't have to do anything manually. You'd just define your functions using this library instead of the built-in functions, since you can use the dual number functions both to differentiate or simply to evaluate (setting the dual part to 0).

> This "discretizes then differentiates"

Not sure what you mean. It defines dual numbers, then defines elementary functions on dual numbers (I only did two as an example). From there, you get differentiation for free (i.e., automatically). The only thing that was done manually was defining the testFunction. Everything else would be part of a library that you'd consume.

I'm not sure what you mean by "equational reasoning is broken".

Thank you for the link to the paper. Seems interesting, and I'll read through it more. Although, it is discussing differentiating integrals, which is where their language "discretize-then-differentiate" comes from. From this paper, I sort of get a sense of why differentiable programming might make sense as a concept, but I've only ever seen the term introduced with automatic differentiation, which is what I was balking at (given the content of the original post here). I'll keep reading this paper, but I think what you've mentioned before hasn't convinced me. Thanks for the discussion.


> In a toy example like this where you're doing everything manually then you probably don't care, but for larger systems, it gets tiring to do the mental equivalent of assembly programming.

But by this argument (which sounds plausible to me) you have defeated your previous claim that differential programming is really a new paradigm, as it seems you have adopted what bmitc wrote earlier, that differential programming is not a new paradigm but "seems like automatic differentiation just implemented properly".


There's no contradiction: autodiff is a method of implementing differentiable programming. In this example, it is implemented as a type that handles a trace of a program, but everything else is left to the programmer. This is a problem because most of the code I would want to write is not a single trace!

Analogously, I could write a program in C that does message sends and organizes code in a design pattern called "objects" and "classes". Incredibly painful, but workable sometimes. Some people even call it "object oriented C" and go on to create a library to handle it like [1]. Is object orientation not a paradigm because I've implemented a core piece as a library?

No, because that misses the intangible part of what makes a paradigm a paradigm: I structured my code this way, for a reason. In OOP, that reason is the compartmentalization of concerns. The underlying OOP mechanism gives me a way to reason about composition and substitution of components to minimize how much I have to reason about when writing code. Similarly, in differentiable programming, the differentiability of all things gives me a way to reason about the smooth substitution of things because it more easily lets me reason about how the machine writes code.

[1] https://en.wikipedia.org/wiki/GObject


Seems we're arguing about definitions. Currently differentiable programming seems to be this vaguely defined term (I don't get what you mean by smooth substitution), with autodiff being its only (proper) instantiation.

You say autodiff is actually not representative of differentiable programming. But if there aren't any other good examples that illustrate differentiable programming, how is differentiable programming (currently) more than autodiff?...


@bmitc: Reading your replies (some of which seem to have been written the same time I wrote mine), it seems we are on the same page; I'm also a mathematician and I also have some qualms with how people invent new names for automatic differentiation :) I had a look at your bio and couldn't find any email address. Would you perhaps be interested in having a longer, scientific discussion about AD?


Julia's AD is compatible with control flow. They have their own issues, but Zygote + ChainRules actually work pretty well


Mathematician here.

> Differential programming is a term used to describe systems that let you take derivatives of your entire source code.

I think the way this is stated is incorrect. What would the derivative of a program that outputs the reverse the input string be? It would be meaningless.

It seems rather to be the case that it describes systems that let you take derivatives of your entire source code where that source code describes a numerical function. Perhaps it is obvious to state this, but I keep seeing differentiable programming being described as "taking a derivative of the source code" and it seems annoyingly imprecise.

> This lets you do things like AD through a simulation or optimization algorithm

This is also, taken literally, not a correct statement. It's not the algorithm which you want to differentiate, but a hard-to-differentiate function within the algorithm (which is always a gradient-descent type algorithm, as it otherwise makes no sense to consider derivatives; there is the entire domain of derivative-free optimization BTW). So, in case of (stochastic) gradient descent to train a neural network, that would be the neural network that you want to use AD on, not the gradient descent algorithm. Obviously.


> So, in case of (stochastic) gradient descent to train a neural network, that would be the neural network that you want to use AD on, not the gradient descent algorithm. Obviously.

The specific proposal of differential programming as a paradigm that goes beyond simple applications of gradient descent for optimising NNs is exactly to apply gradient descent to optimise learning (and other) algorithms themselves. This may be to select hyperparameters, or to select among a family of algorithms, or to include differentiable constraints in the solver, etc.

I wonder, are you familiar with now classic paper, Learning to learn by gradient descent by gradient descent?

[0] https://arxiv.org/abs/1606.04474


I don't really see the distinction. In the basic case of using gradients to optimize a NN, what's really being optimized are the parameters of a function expressing the NN error. We're just following the gradient downhill to find whatever minimum of this function the gradient leads us to (hopefully the global minimum, as is likely to be the case with lots of parameters).

In this "learning to learn" paper the optimization function (used to update the gradients at each step of the gradient descent algorithm) has been parameterized, and we're instead using gradient descent to find a minimum of that parameterized function.

So, yes, differentiable programming isn't limited to minimizing the losses of neural nets, but as the grandparent poster pointed out, the whole concept obviously only applies to parameterized programs calculating some numerical output that we want to miminize.


Wow, it’s interesting that a paper from 2016 is already classic. This field moves fast!


Well, I'd say it's a "classic" within the field of differentiable programming, which is a new field, so... everything is relative.


Agreed about the binomial formula, but don't you need induction to prove the binomial theorem? That is, if you have some sort of set that's like the naturals except that Peano's axiom of induction doesn't apply (such as the naturals plus Alberto and Cristina, who are one another's successors) is the binomial theorem necessarily true in it?

Agreed about Maclaurin series.

I didn't know there existed smooth functions with divergent Taylor series. I don't understand how that happens; if we're looking at term $n$ of the Taylor series around some point $a$, it's $f^{(n)}(a) \frac{(x - a)^n}{n!}$, where $f^{(n)}$ is the nth derivative, right? I guess you could have a function whose derivatives eventually start increasing so fast that even if (x - a) is 10^{-100}, the exploding derivative eventually wins?

PARI/GP helpfully informs me that deriv(atan(x)) is 1 - x^2 + x^4 - x^6 + x^8 - x^10 + x^12 - x^14 + O(x^16), which sure sounds like the sort of thing that would diverge when |x| > 1. But presumably that's based on the Maclaurin series for arctan, and you'd get a different result if you were taking a Taylor series around 0.9 or something? In response to taylor(atan(x-9/10), x) PARI/GP says things I will not repeat here, so I guess blindly pounding on the keyboard isn't going to get me very far. Some guidance could be helpful.

I think differentiable programming is a new paradigm in the sense of logic programming and functional programming, and I do think it goes beyond just autodiff (though the linked article only explains autodiff). To the extent that a programming system is differentiable, you can use gradient descent to search for programs that minimize a loss function, not just inputs that do. But even just searching for inputs that minimize a loss function is a pretty different way to program than the conventional approach, even though Ivan Sutherland proposed it as a general approach in SKETCHPAD.

An example of a differentiable programming system is in https://arxiv.org/abs/1605.06640 "Programming with a Differentiable Forth Interpreter", 02016.


> Agreed about the binomial formula, but don't you need induction to prove the binomial theorem?

Yes, that's the point. :) In that, they're basically re-proving the binomial theorem (or more accurately, using the same method) rather than just using its results with the new dual number arithmetic.

Michael Spivak's Calculus has some great chapters on Taylor polynomials, Taylor's remainder theorem, and Taylor series. I've been looking into this stuff lately, to try and justify mathematically why automatic differentiation works (I've never read anything that does so, including several published papers), and that's where I was reminded of the fact that not everything is gold with Taylor's polynomials on real numbers. Maybe (?) the introduction of dual numbers gives something akin to complex analysis' concept of analytic. Not sure and not there yet.

> I think differentiable programming is a new paradigm in the sense of logic programming and functional programming, and I do think it goes beyond just autodiff (though the linked article only explains autodiff).

After some of the responses, maybe that's where I'm balking at. In that, I've only ever seen differentiable programming mentioned in the context of automatic differentiation. One of the papers posted led me to be more comfortable with the concept of differentiable programming as a distinct thing, in a sort of Mathematica sense, but I'll definitely need to read some more.


> I've been looking into this stuff lately, to try and justify mathematically why automatic differentiation works.

What do you mean by that? Isn't the theory (at least the mathematical, abstract version of it) already fully represented in books like Griewank that I mentioned in another comment here in this post?

I remember seeing somewhere a rather well develop theory using dual numbers (but this was a purely mathematical development); but I think that there is a bit of a division between computer scientists and mathematicians, so it could be that the whole machinery is mathematically already development and is now being re-development within the mantle of theoretical computer science, in and given new closing in terms of a programming language/semantics that supports the mathematics "natively". But that is more of a guess I'm having.

> One of the papers posted led me to be more comfortable with the concept of differentiable programming as a distinct thing.

Which paper was that?


> Isn't the theory (at least the mathematical, abstract version of it) already fully represented in books like Griewank that I mentioned in another comment here in this post?

I guess it's in there somewhere, but like many of the papers, I find that book to overcomplicate things and just not to my liking. I also don't see any mention of dual numbers, although it's probably somewhat implicit in the development (maybe). I'm mainly interested in the dual number approach. With the dual number approach and forward-mode automatic differentiation, there's really no reason to overcomplicate the developing by discuss "propagating tangents". The development can be more much intuitively done because using dual numbers, automatic differentiation just falls out without discussion of propagation.


I think autodiff is mostly interesting because of optimization by gradient descent, and gradient descent is only efficient using reverse-mode autodiff, which can't be done with dual numbers.

I think optimization by gradient descent (and its variants like Adam) is interesting because it's the only optimization algorithm family I know of whose runtime scales linearly oreven near-linearly with the number of dimensions, so for large-dimensionality problems it's the only computationally feasible algorithm. Also, though it's only guaranteed to find the optimum for differentiable convex problems, in practice it seems to work well enough to be interesting for nonconvex problems.

I think optimization is interesting because it allows you to solve arbitrary inverse problems, or at least continuous relaxations of them.


Yes, the book is hard to read. I just went through a few chapters of it.

> One of the papers posted led me to be more comfortable with the concept of differentiable programming as a distinct thing.

I'd still be interested in knowing which of the papera that was ;)


Rather then repeating various arguments, I think it's best to just mention this link, which provides explanations and illuminating examples of the phenomenon of smooth functions with diverging Taylor series: https://mathoverflow.net/questions/72/whats-an-example-of-a-...

Quote by Pietro Majer from the link: "it is more useful to communicate the (historically non-trivial) idea that there is a difference between a function and a representation of it by means of a formula"


Thanks! I wonder why the example of a smooth function with a diverging Taylor series here is so much more complicated than arctan.


Object oriented programming, for example, doesn't let me have a variable hold half of one object and half of another or let the language derive the code that gave me that object at runtime, but object oriented + differentiable programming does. It's no less of a paradigm than logic, quantum, or probabilistic programming. If you want to, you can view differentiable programming as extending logic programming with a product and chain rule (+ some additional constraints) that allows (smooth, if you want it) interpolation between data and code.

That said, most discussion of differentiable programming is at the level of syntax sugar for reverse mode differentiation, so I can't blame you for that conclusion.


You don't need OOP plus another paradigm to do automatic differentiation though. I've implemented automatic differentiation, albeit "simple" versions and only forward-mode at this point, but there's really nothing special about the implementations.

Logic programming, on the other hand and for example, needs something much more substantial to be implemented as a library in an existing language, such as backtracking, unification, or the full-on Warren Abstract Machine.

If someone has a clear example of differential programming that is different than just using automatic differentiation as a technique or library, then that might help.

> doesn't let me have a variable hold half of one object and half of another or let the language derive the code that gave me that object at runtime

I'm not sure what you mean here. Could you elaborate?


I don't need OOP to do a hash table lookup and then an indirect function call with the receiver as the first argument either but that ignores that there's more to a paradigm than the algorithm I use for facilitating it. You can embed unification of expression trees as a library in C++. People implement backtracking all the time in almost every language. Talking about differentiable programming as if it's just autodiff is missing the point of what a programming paradigm is.

There's a mechanism, yes, but that's just a means to an end of efficiently enabling a different way of approaching programming. In the case of differentiable programming, that's continuous code and continuous data enabling program search that doesn't have to use purely discrete methods (like logic programming). If that sounds like autodiff and backprop, then yes, that's because that's a good way to implement it. Tensorflow and PyTorch are DSLs embedded in Python and C++ both useable and used for more than just implementing neural networks, but most people aren't happy calling a library a language until it has a parser and a file extension.

> I'm not sure what you mean here. Could you elaborate?

Most programming languages assume that a variable can only contain one value, or a composite value of values. Differentiable programming lets code be smoothly transformed from one to the other while being meaningful at all points between. In an object oriented case, this would be like having a variable contain an object that behaves like some known object A or object B selectively depending on which choice maximizes the success of the program at any given moment.


Seems a bit of conflation of notions going on here: Just because logic/quantum/probabilistic all end with "programming", it doesn't mean that they all are what is called a "programming paradigm" in the typical sense [0]. Where does a paradigm end and a library implementation begin?

For example, varying how a program arrives at the answer induces impertive/OOP vs. declarative paradigms [0]. On the other hand, quantum programming assumes a radically different type of "CPU" (i.e. instruction set) on which your program runs - which in turn obviously changes everything. But this can safely by implemented within existing paradigms, e.g. ProjectQ [1] is implemented in an OOP language: Python. Thus, I would not call this a new programming paradigm (unfortunately [0] does that, but I think that is bad form).

> you can view differentiable programming as extending logic programming [...] that allows interpolation between data and code.

Reference? I have browsed one of the standard references [2] on automatic differentiation and googled a bit and could not find something that supports your statement. Even more so, is seems that even defining semantics for differential programming is barely in its starting stages [3].

> differentiable programming is at the level of syntax sugar for reverse mode differentiation, so I can't blame you for that conclusion.

By the same argument you could say that probabilistic programming is just syntax sugar for painless specification of statistical models.

Care to provide a reference where differential programming is presented as something more than "syntax sugar"?

[0] https://en.wikipedia.org/wiki/Programming_paradigm#Further_p...

[1] https://projectq.ch/

[2] Griewank A., Walther A., Evaluating Derivatives, SIAM 2008

[3] https://arxiv.org/pdf/1911.04523.pdf


The existence of a different kind of CPU isn't a meaningful distinction at the level of discussing paradigms. The semantics are different, so the abstract machine is different. The fact that I need a different set of atoms in my desktop to use it doesn't change the programming language part of the discussion.

The main paper to read is [1] which introduces a syntactic notion of differentiation in the lambda calculus connecting substitution and nondeterministic choice to differentiation in the calculus of infinitesimals sense and also introduces a meaningful notion of Taylor expansion of arbitrary programs. This paper is mostly of academic interest, though. The resulting expansion is wildly uncomputable meaning that more modern, practical papers like [2] cite it wistfully as a dream of what could be achieved. How to computably handle most of the constructs we care about in a general programming sense is very active, open research. At the time the paper was introduced, it was more influential on (and influenced by) work on probabilistic and quantum programming through their related models of linear logic [3]. There are only a few slight axiom differences that separate differential, logic, probabilistic, and quantum programming though, so if you're willing to accept one as a "paradigm", then you should accept the others.

[1] https://www.sciencedirect.com/science/article/pii/S030439750... [2] https://arxiv.org/abs/1911.04523 [3] https://ncatlab.org/nlab/show/differential%20category


It seems you haven't read my references? As your [2] is my [3] from above!

> The existence of a different kind of CPU isn't a meaningful distinction at the level of discussing paradigms.

Well, that was my point above: You can't really lump quantum programming together with probabilistic programming, as they are paradigms on different "levels".

> practical papers like [2] cite it wistfully as a dream of what could be achieved

Are you sure about that? I skimmed [1] as I wasn't hadn't read it and it seems to describe a rather restricted set of functions ("types are interpreted as vector spaces and terms as functions defined by power series on these spaces"), as there are many differentiable functions that cannot be defined as power series.

Moreover, in [2] it is only claimed: "Ehrhard and Regnier {i.e. your reference [1]} do not give an operational semantics but they do give rules for symbolic differentiation and it should not be too difficult to use them to give an operational semantics. However their language with its convenient vector space semantics only supports total functions. It therefore cannot be extended to include recursive function definitions or conditionals (even with total predicates, as continuous functions from R^n to the booleans are constant)." So I would not say they cite [1] as a wistful dream ...

> There are only a few slight axiom differences that separate differential, logic, probabilistic, and quantum programming though.

Give me the axioms and their differences and I believe you. :) (Honestly, I'm not even sure if the discussion on which axiomatization captures the existing developments has been settled; it seems to me you have some kind of category theoretic approach in mind where you just change the category and get a new paradigm - I'd be happy to accept this as well, if there is a clear reference, though I'm doubtful one exists ...)


I don't usually respond to old comments, so I don't know if you'll read this, but I hope I can encourage you to think more broadly about what "differentiable programming" means.

Different fields have a different perspective on the same set of tools because those tools have different pathological cases in different areas. Context really matters.

> Well, that was my point above: You can't really lump quantum programming together with probabilistic programming, as they are paradigms on different "levels"

This distinction is not useful. If I write in a functional or logic programming language, it gets translated into imperative commands for an underlying architecture that is some mix of dataflow, event driven, automata-based, concurrent, etc... that is then further built on top of some physical atoms where an engineer worried about quantum effects. If I write in a quantum programming language, it will probably go through the same process for at least another 5 years. You might argue that quantum is somehow more dictated by the underlying physical model the way that people argue that imperative programming is closer to the physical world than functional programming. But the "level" doesn't change the usefulness of viewing all of these as "paradigms" worthy of study and analysis on their own terms with their own tools. At the level of studying a programming language, the "level" is a useful thing to be aware of for implementations and motivation but usually not for a theory.

> Even more so, is seems that even defining semantics for differential programming is barely in its starting stages

This is also not a useful distinction. OOP was also, infamously, a point of contention between the academic and outside worlds because it was developed and incredibly prevalent without a rigorous theory abstracting it beyond procedural programming. It became a "paradigm" despite that because there was a set of (informal) tools for reasoning about it on its own terms [1].

Likewise, differentiable programming has largely developed to formalize what makes programs written for machine learning frameworks different from programs written in the imperative/object oriented/functional language they are built on. Autodiff has mostly developed in practical usage, so the use cases are front-running the theory. There's increasingly hardware tailored to the execution model and software developers attempting to program it. There are approaches to problems like discontinuities that people have found solutions for without a rigorous theory justifying their use. There's a structure to why and how people are writing code for these applications as well as an operational theory for how to reason about it, but there's very little compositional, equational theory for these choices.

To most people in machine learning, "differentiable programming" is just autodiff with pretty syntax because the term sprung from attempting to put what they are already trying to accomplish with that implementation on more solid theoretical footing as a computable model of a more general logic. That, hopefully, lets us more efficiently explore what a better domain-specific theory might be and if there are better execution models or logical frameworks. Autodiff itself is increasingly used as an umbrella term for many other methods of differentiation with different edge cases, so this is already happening in practice.

To reduce "differentiable programming" to just its implementation ignores that aspect. It would be equivalent to equating machine learning to matrices. Not unreasonable computationally and not a terrible place to start for a theory, but deeply unsatisfying as a mature domain-specific theory.

The main paper I linked [2] is not about autodiff at all. It's an attempt to establish a connection between differentiation in an analysis sense to models of (not otherwise obviously differentiable) program evaluation. The (unrealized) promise is that the centuries of understanding we have for the calculus of infinitesimals can be applied to the less-mature study of lambda calculus and nondeterministic computation. Papers like [3] cite it because it addresses (discrete) structures that analysis is less interested in and potentially provides a way to connect computation, calculus, and whatever it is that we're doing with machine learning.

> Are you sure about that? I skimmed [1] as I wasn't hadn't read it and it seems to describe a rather restricted set of functions ("types are interpreted as vector spaces and terms as functions defined by power series on these spaces"), as there are many differentiable functions that cannot be defined as power series.

Because it's a PL theory paper, it's not concerned with whether all differentiable functions can be represented, but whether all computable functions can be differentiated. And PL theorists are generally more comfortable than most to accept that most functions cannot be computed and choose a more restrictive model that enables more reasoning power. The category [4] is really the better place to start since it lets us also consider models that aren't vector spaces and [2] is best thought of as a prototype that left many gaps in the theory (for example the requirement on coefficients for convergence is wrong, though I can't remember which paper by Lionel Vaux proved this). It can be thought of as computable in finite instances, but it's unsound even when typed due to the zero term and result of sums.

The quote you cite from [3] is easily misunderstood without that context. As a practice-focused paper, it cares very much about computability. Conditionals and loops are possible in [2] since it allows church numerals and fixed point combinators but it introduces a nondeterministic sum which is exponential in the number of evaluation steps and may diverge (doubly so since it's the untyped lambda calculus...) and is difficult to operationalize. That's what I meant by "wildly uncomputable". So, to them, [2] offers a useful mental framework for higher order features, but is not practical. The theory isn't there yet.

The connections between logic, quantum, probabilistic, and differentiable programming can be understood by how the model treats the exponential modality (!) which converts the otherwise linear term to an analytic one. Differentiation decomposes this to give a sum of linear terms. Differential lambda calculus doesn't put any (more) structure on the sum. Probabilistic programming gives the added structure of a probabilistic sum where coefficients are weights. Quantum programming can be modeled via a Fock space [5] for (!) ([5] predates [2] so is not directly discussed as a differential category here). However, it's unclear what the right model for differentiable programming should be if we want something practical for the resulting derivative (much less antiderivative). Daniel Murfet et al [6] have some related work more directly in the context of machine learning.

[1] https://www.cs.cmu.edu/~aldrich/papers/objects-essay.pdf [2] https://www.sciencedirect.com/science/article/pii/S030439750... [3] https://arxiv.org/abs/1911.04523 [4] https://ncatlab.org/nlab/show/differential%20category [5] https://www.researchgate.net/publication/2351750_Fock_Space_... (sadly a researchgate link) [6] http://therisingsea.org


> don't know if you'll read this

I did read it ;) Because I'm very much interested in this entire topic.

> I hope I can encourage you to think more broadly about what "differentiable programming" means

I'm trying to, but I find it hard. My stance was that differentiable programming seemes like this theory for which only a single example (namely autodiff) existed, as you also said ("autodiff has mostly developed in practical usage, so the use cases are front-running the theory"). But this entire comment of yours really clarified some things for me.

> The main paper I linked [2] is not about autodiff at all. >The quote you cite from [3] is easily misunderstood without that context

You finally convinced me to have a detailed look at this. Thank you for providing the context.

> Conditionals and loops are possible in [2] since it allows church numerals and fixed point combinators but it introduces a nondeterministic sum [...] and is difficult to operationalize. That's what I meant by "wildly uncomputable".

I think I may have misunderstood some of your previous comments (and perhaps vice versa) as it now dawns on me that you use a vocabulary than comes (I guess?) from PL theory and is very different from the one I'm used to, as a mathematician versed in analysis. I'll re-read them.

> Daniel Murfet et al [6] have some related work more directly in the context of machine learning.

I'm actually aware of Daniel Murfet but haven't read his work from last years. Did you have a specific paper from him in mind?


P.S. (In case you return once more to read comments.)

It seems like these are problems that could benefit from a PL-theory <-> mathematical analysis cross-collaboration.

I would have sent you a private message via email, but couldn't find any info on your HN profile; my email is there.


> That's not exactly true. For example, arctan = tan^-1 is smooth on R, but its Taylor series only converges for |x|<=1.

This issue creeps in a lot but I think it comes from the complex analysis, where differentiable function is always smooth and analytic (holomorphic).


But as far as I know, there is no corresponding dual analysis. The dual numbers do not form a field.

The issue primarily revolves around elementary functions, compositions of them, and polynomials.


Yeah, I was only trying to explain where the issue (assuming smooth=analytic, that you rightfully pointed out) might have come from


Ahh, gotcha. I'm trying to figure this out myself. I've always been intrigued by automatic differentiation but have never been satisfied by any article addressing its mathematical justification.


Neither have I ...



Related post:

Differentiable Programming – A Simple Introduction | 159 points, 3 months ago, 49 comments | https://news.ycombinator.com/item?id=31000709

There I mention "The simple essence of automatic differentiation" which posits that by keeping the a differentiated function paired together with its integrand then many functional transformations -- including integration, subexpression deduplication, and even automatic conversion from forward mode to reverse mode -- would be greatly simplified. Allegedly this claim is not backed up by a practical demonstration yet, but imo this is a very intriguing approach.


This is the first time i've read about differentiable programming and it was incredibly interesting. Also, it's the first time I've came across this blog and it's wonderful. The minimal interactions really helped me visualise the content and it helped with comprehension. Great job.


Yes, very beautiful stuff. Only thing I don't like: The text in the SVG's is not texed, but that's difficult to achieve, I guess.


Nice, I hope Max Slater keeps writing about math, because this article seems more like it is written for a human than a lot of math texts. But it doesn't skimp on complexity or use silly gifs to do that.


I feel differentiable programming is not really there.

I understand the specific case of numerical functions, it may make sense if you just want to tweak some parameters a bit like a nn, or for example in quantum diff programming people seem to tweak parameters in some Hamiltonian.

What I actually want is something you will actually change the structure of the program somehow.

If you pretend the entire program is a function f, parameters are just points on your program. Why is moving to different points differentiating. Differentiating is producing a new function f’.

In the case of the NN it makes sense to select different points. But then we do also modify the structure with dropouts etc.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: