Any Deep ReLU Network Is Shallow

WithinReason · on June 23, 2023

I thought it was known for a long time that any function can be represented with a neural network with a single layer. It's an almost trivial finding if you think about it: imagine you have a steep step function that looks something like this: __/^^ with the non-zero derivative in a small range, e.g. 0.000-0.001 (or ϵ if you like). Let's call this f. You can piece together any function from these tiny pieces as ∑ᵢ cᵢf(aᵢx+bᵢ)+dᵢ. You just need an infinite number of them. This sum is a neural network with a single hidden layer.

blt · on June 23, 2023

The result in the paper is not an approximation result. The shallow network is of bounded width and is exactly equal to the deep one.

elcomet · on June 24, 2023

Only for relu networks though, which are already piecewise linear functions.

curiousllama · on June 23, 2023

> Based on this proof, we provide an algorithm that, given a deep ReLU network, finds the explicit weights of the corresponding shallow network.

I’m out of date on the research, but I suspect the real value here is the algorithm. Sounds like some version of this could eventually help reduce inference time

tripplyons · on June 23, 2023

I haven't read the paper, but most of the time, in order to achieve matching outputs with less layers, you will need exponentially more neurons per layer.

If you have infinite parallelism I believe the shallow network would be faster, but deep networks will use less total operations and will be faster in practice.

alok-g · on June 24, 2023

+1

This is how it is for logic design using gates, and I believe that's not a coincidence.

Finding the right optimum between parallel and deep for logic design is computationally intensive itself, hence it often comes down to experiencial learning. The latter implies potential use of machine learning for that optimization. And it contunues ... lived happily thereafter. :-)

uoaei · on June 24, 2023

There may be some utility to optimizing a given neural network to exploit the features of a given computational architecture. The point would be to expand the width of each layer to hit the limits of your hardware's parallelism. I.e., partition the network into n parts and shallow-ize each one to produce a new neural network with n layers, each of which being roughly as wide as your p parallel computational channels.

WithinReason · on June 23, 2023

I suspect the trade-off for a shallow network is an exponential explosion in weights and computation.

joe_the_user · on June 23, 2023

From the paper:

    Corollary 1. Given a ReLU network N : Rn → Rm, the shallow 
    network S : Rn → Rm of depth L = 3, as specified in the proof 
    of Theorem 1, has width bounded by 
    max{2n + k, 2n + p, 2 · p · m}

I think p might be the number of layers but I'm not sure and I'm sure about k at all.

So it they might be claiming a pretty tight bound in on weights.

a1369209993 · on June 24, 2023

They (suspiciously but not necessarily intentionally) avoid calling attension to it but:

> Let ({fω}ω∈Ω,Ω) be the decomposition that corresponds to N. Label the parts 1,...,p.

p is the number of regions the deep network divides R^n into, and is in general exponential in the network depth. (Eg consider R^1 = [-1,+1], L[k](x) = 2*abs(x)-1 ∀k (where abs(x) = x - 2*ReLU(-x), IIRC), which divides R^1 into 2^D equal segments using only D individual ReLUs, assuming I wrote the formulas down correctly.)

So for a arbitrary deep network the shallow network size is definitely exponential in input depth, and while it's not clear whether that's true for typical 'trained' deep networks, I would be surprised if it wasn't.

uoaei · on June 24, 2023

I would assume the distinction between trained and untrained networks is a constant factor at best.

a1369209993 · on June 25, 2023

I meant 'trained' networks in contrast to totally-synthetic intentionally-pathological networks like my 2abs-1 example. (It could be that only pathological cases give bad results - similar to how (Hindley-Milner) type inference is EXPTIME-complete but works fine for ~all real world programs - but that seems unlikely.)

youssefabdelm · on June 24, 2023

I guess people could use it as a form of post-training "compression"

coderenegade · on June 24, 2023

The best use case would be if you can efficiently switch between representations. Shallow nets don't suffer from the vanishing gradient problem, so you could have very, very deep RNNs, convert to a shallow net for the gradient update, and then convert back to an RNN for the next rollout. Of course, most RNNs track an internal state, which isn't considered in the paper, but overcoming that limitation could make RNN training effective for long enough rollouts that transformers are no longer needed, especially given that RNNs scale better for long sequences than transformers do.

throwawaymaths · on June 23, 2023

I'm not sure it's exponential, but it is dramatic.

tdullien · on June 23, 2023

coderenegade · on June 24, 2023

If you can go back and forth between representations with better than quadratic scaling, it would mean RNNs with infinite context lengths and no vanishing gradient. You wouldn't need transformers.

throwawaymaths · on June 23, 2023

this is the "universal approximator theorem".

https://cognitivemedium.com/magic_paper/assets/Hornik.pdf

> infinite number of [activations]

I don't think you need an infinite number of them, there is a relationship between "how close you want to get" and "how many of them you need".

hervature · on June 23, 2023

I would just add the caveat that it is a single hidden layer. If you were to write this as a sequential model you would have something like:

nn.Sequential(

    Dense(many neurons),

    Dense(1)

)

From this, it's pretty easy to see it is "two" layers but also from your equation c_i and a_i denote two separate matrix multiplications.

amelius · on June 23, 2023

This is like how you can write any boolean function as a (large) sum of minterms.

extasia · on June 23, 2023

Thanks for the explanation. I've heard this before but think I get it now!

madmaxmcfly · on June 23, 2023

Isn't that just like a Taylor series?

charliea0 · on June 23, 2023

Yes - or Fourier series or other decomposition into a sum of orthogonal basis functions.

mkl · on June 24, 2023

I don't think orthogonality is important here. Taylor series basis functions are not orthogonal, and nor are ReLUs.

godelski · on June 24, 2023

I think you're being a little too quick to jump to the answer. There are plenty of functions where this isn't true. Two easy examples are jump discontinuities (one that looks like this ____----- but more wiggly) (known as piecewise continuous) and functions with hole. These cannot be approximated to arbitrary accuracy, even with an infinite number of them.

You need to pay close attention to the wording in the approximation theorems.

Hornik[0] which does the proof you're discussing says

> This paper rigorously establishes that standard multilayer feedforward networks with as few as one _hidden layer_ using arbitrary squashing functions are capable of approximating any __Borel measurable function__ from one __finite__ dimensional space to another to any desired degree of accuracy

The confusion is probably in the Borel sigma-algebra. Borel means that it includes all finite open intervals in the real numbers. So (0,1) but not [0,1]. Open means boundary is not included! The interval also needs to be continuous, so our discontinuities violate the assumptions. Funahashi's[1] and Cybenko's[2] also require continuous functions.

This is actually a really important thing that gets ignored because it "seems obvious." Or maybe we just see it too often. But it is __critical__ to pay attention to assumptions and limitations. There is a lot of that going off the rails lately with ML and it's going to give us some roadblocks (we're already seeing some[side note]). EVERYTHING (and I mean literally everything) has assumptions, and therefor biases[3]. Every evaluation method you use has a bias. Every learning method you use has a bias. Every dataset. Every architecture. Everything. This is because everything has a certain number of assumptions baked in. We try to reduce these and make them as sane as possible, but we should be aware of them. And in the case of the universal approximation, well the limitation is fairly meaningful. Data is not guaranteed to lie upon a smooth continuous manifold.

[0] https://cognitivemedium.com/magic_paper/assets/Hornik.pdf

[1] https://dx.doi.org/10.1016/0893-6080%2889%2990003-8

[2] https://link.springer.com/article/10.1007/BF02551274

[3] https://en.wikipedia.org/wiki/Bias_(statistics)

[side note] We actually see this a lot in evaluation, and this is why benchmarkism is so problematic. Because it causes us to look at results as hard signals instead of guides. Evaluation is fucking hard. No empirical results will ever give you the full answer: the map is not the territory. We saw a hugging face blog today[4] about the LLM results differing and the reason is because the different evaluation methods biased towards different models. This means the metrics can be hacked, even specifically by the RLHF tuning. Or even the tokenization. These things are hard to make real good judgements on if you don't know the limits of the evaluation method (i.e. the assumptions it makes).

[4] https://huggingface.co/blog/evaluating-mmlu-leaderboard

ndriscoll · on June 24, 2023

> Borel means that it includes all finite open intervals in the real numbers. So (0,1) but not [0,1].

The Borel algebra is generated by open sets, but it includes complements (and therefore closed sets) as well. In fact it's also generated by closed sets. The Borel algebra of R also contains sets like the rationals and the irrationals. The types of sets that aren't included in the Borel algebra (but are in the Lebesgue) are nasty things like (some) subsets of the cantor set.

wewxjfq · on June 23, 2023

This is such a bullshit HN comment, lmao.

dang · on June 23, 2023

Could you please stop posting unsubstantive comments and flamebait? You've unfortunately been doing it repeatedly. It's not what this site is for, and destroys what it is for.

If you know more than others, that's great—please share some of what you know, so the rest of us can learn something, or else don't post. Sneers and putdowns only make everything worse.

If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.

kixiQu · on June 23, 2023

The benefits here are not really to do with implementing inference, but rather the improvement in explainability that a shallow network can provide. Skip to section 5 (page 8) for that.

throwawaymaths · on June 23, 2023

If anything inference will probably get worse. Typically in ML constraining your architecture (which is what you're doing when you make a deep narrow network out of a shallow wide network, or rnns or grus/lstms, or transformers out of fcnns) yields inference and training performance benefits.

superkuh · on June 23, 2023

So any N-layer network with the relu _/ activation function can be restated in terms of a 3 layer network as long as the 3 layers can have infinite parameters and as long as you have near infinite computing power to convert the big ones. Cool but not yet useful.

visarga · on June 23, 2023

> So any N-layer network with the relu _/ activation function

The paper talks only about models with additive operations among activations. It doesn't say anything about more complex networks like transformers.

In transformers there are multiplicative interactions between activations inside the attention matrix, it is unclear if they can be approximated with just a 3 layer ReLU network, or if such conversion would be practical at all.

uoaei · on June 24, 2023

It would be very surprising if this result was relevant for transformers considering the softmax that's present there.

visarga · on June 24, 2023

PartiallyTyped · on June 23, 2023

> How well does a classic deep net architecture like AlexNet or VGG19 classify on a standard dataset such as CIFAR-10 when its “width”— namely, number of channels in convolutional layers, and number of nodes in fully-connected internal layers — is allowed to increase to infinity? Such questions have come to the forefront in the quest to theoretically understand deep learning and its mysteries about optimization and generalization. They also connect deep learning to notions such as Gaussian processes and kernels. A recent paper [Jacot et al., 2018] introduced the Neural Tangent Kernel (NTK) which captures the behavior of fully-connected deep nets in the infinite width limit trained by gradient descent; this object was implicit in some other recent papers. An attraction of such ideas is that a pure kernel-based method is used to capture the power of a fully-trained deep net of infinite width.

https://openreview.net/pdf?id=rkl4aESeUH, https://github.com/google/neural-tangents

> It has long been known that a single-layer fully-connected neural network with an i.i.d. prior over its parameters is equivalent to a Gaussian process (GP), in the limit of infinite network width.

https://arxiv.org/abs/1711.00165

And of course, one needs to look back at SVMs applying a kernel function and separating with a line, which looks a lot like an ANN with a single hidden layer followed by a linear mapping.

https://stats.stackexchange.com/questions/238635/kernel-meth...

sp332 · on June 23, 2023

Nice! I have to admit I didn't follow all the math, but the layers get really wide, right? You still need to describe all the partitions between the pieces of the piecewise linear functions. Does this save any computation at inference time?

polygamous_bat · on June 23, 2023

My hunch is that you can really easily find (adversarial) counterexamples that may show that in the worst case you will always use more compute by shallow-ing. In my head it should be related to no free lunch theorem.

foota · on June 23, 2023

I'm not an expert, but I wonder if the wide network could then be trimmed down for special purposes? E.g., find the parts of the network needed for a certain task by looking at activations on inputs and dump the rest.

hedgehog · on June 24, 2023

Besides pruning (which is what you describe) you might find the "lottery ticket hypothesis" interesting. The idea is essentially that in a randomly initialized large network there is a subset of weights that are already closeish for a given task and training helps tune that and suppress everything else, and that knowing this you can actually get better results by pruning before training.

neodypsis · on June 23, 2023

Weight pruning is already a thing. See, for example: https://www.tensorflow.org/model_optimization/guide/pruning

littlestymaar · on June 23, 2023

That was my thought as well, it would be nice if someone more knowledgeable came and explained us why it's stupid/doesn't work

p1esk · on June 24, 2023

Any neural network (deep or shallow) can be rewritten as a simple (but potentially very large) lookup table.

cellis · on June 24, 2023

Could you point to an inductive proof of this?

anonymoushn · on June 24, 2023

I don't think you need induction, you can just use the fact that the input is a fixed size and the output is a fixed size. The "constructive" proof is, feed in all possible inputs, record the outputs in a lookup table.

radarsat1 · on June 24, 2023

Which obviously only works for the training data. It's a good example to remond that the whole point is to predict unseen input output pairs (generalization) so what is important is not so much the ability to fit a function, but to interpolate and extrapolate that function. And different bases and different fitting algorithms will have different behaviour in that respect.

p1esk · on June 24, 2023

We are talking about converting a trained neural network into a lookup table.

radarsat1 · on June 24, 2023

Ah, get your point and misunderstood. I thought you were talking about comparing the neural network to a lookup table, not modeling the network itself. In that case the proof is only true if you consider the digital implementation of the neural network. Since it's a continuous function this proof would be impossible mathematically, as the domain is not enumerable. But if you consider only every possible float32 for example, then it works.

In any case it's kind of a useless statement that way as it says nothing about neural networks. You can replace "neural network" with "function" and it still works.

p1esk · on June 24, 2023

You're right, but the idea of looking things up instead of computing them can be useful when we are constrained by the available compute power. I'm not talking about simple lookup tables, of course, but if you look at recent trends in large foundational models, there's a lot of interest in efficient access to external information, or ways to pay attention to the inputs selectively, rather than in all-to-all fashion (e.g. landmark attention tokens).

radarsat1 · on June 24, 2023

Oh yeah I didn't see the current discussion as related to that but I find the topic of fact databases for LLMs pretty interesting, thanks for drawing the analogy.

anonymoushn · on June 24, 2023

What do you mean? It works for all possible inputs.

coldtea · on June 24, 2023

The NN not having side effects is proof enough...

foota · on June 24, 2023

Another maybe silly thought, but if any relu network could be written in three layers, does this allow for a hyper efficient ReLu only hardware for ml where the pipeline is fixed?

alok-g · on June 24, 2023

See these comments on the same post if not already.

https://news.ycombinator.com/item?id=36453136

Short answer to your question: Not true in general.

spencerchubb · on June 23, 2023

Does this result explicitly not hold for other activation functions, or have they just not explored it yet?

gqcwwjtg · on June 23, 2023

It seems like it doesn’t hold since this relies on turning the whole thing into a piecewise linear function.

cheekyfibonacci · on June 24, 2023

the authors seemed to have lost an opportunity to name the paper "In the Shallow-lalalow"

jmole · on June 23, 2023

this is an interesting attempt at explainability, but that's all.

The backward pass of "Shallowed" Deep network would likely result in a useless network with today's training methods.

version_five · on June 23, 2023

This was posted yesterday: https://news.ycombinator.com/item?id=36430442

Actually I know because I also tried posting it but wasn't allowed because it was a dupe (4 or 5 hours old) - instead I just got taken to that post and was automatically deemed to have voted for it.