I thought it was known for a long time that any function can be represented with a neural network with a single layer. It's an almost trivial finding if you think about it: imagine you have a steep step function that looks something like this: __/^^ with the non-zero derivative in a small range, e.g. 0.000-0.001 (or ϵ if you like). Let's call this f. You can piece together any function from these tiny pieces as ∑ᵢ cᵢf(aᵢx+bᵢ)+dᵢ. You just need an infinite number of them. This sum is a neural network with a single hidden layer.
> Based on this proof, we provide an algorithm that, given a deep ReLU network, finds the explicit weights of the corresponding shallow network.
I’m out of date on the research, but I suspect the real value here is the algorithm. Sounds like some version of this could eventually help reduce inference time
I haven't read the paper, but most of the time, in order to achieve matching outputs with less layers, you will need exponentially more neurons per layer.
If you have infinite parallelism I believe the shallow network would be faster, but deep networks will use less total operations and will be faster in practice.
This is how it is for logic design using gates, and I believe that's not a coincidence.
Finding the right optimum between parallel and deep for logic design is computationally intensive itself, hence it often comes down to experiencial learning. The latter implies potential use of machine learning for that optimization. And it contunues ... lived happily thereafter. :-)
There may be some utility to optimizing a given neural network to exploit the features of a given computational architecture. The point would be to expand the width of each layer to hit the limits of your hardware's parallelism. I.e., partition the network into n parts and shallow-ize each one to produce a new neural network with n layers, each of which being roughly as wide as your p parallel computational channels.
Corollary 1. Given a ReLU network N : Rn → Rm, the shallow
network S : Rn → Rm of depth L = 3, as specified in the proof
of Theorem 1, has width bounded by
max{2n + k, 2n + p, 2 · p · m}
I think p might be the number of layers but I'm not sure and I'm sure about k at all.
So it they might be claiming a pretty tight bound in on weights.
They (suspiciously but not necessarily intentionally) avoid calling attension to it but:
> Let ({fω}ω∈Ω,Ω) be the decomposition that corresponds to N. Label the parts 1,...,p.
p is the number of regions the deep network divides R^n into, and is in general exponential in the network depth. (Eg consider R^1 = [-1,+1], L[k](x) = 2*abs(x)-1 ∀k (where abs(x) = x - 2*ReLU(-x), IIRC), which divides R^1 into 2^D equal segments using only D individual ReLUs, assuming I wrote the formulas down correctly.)
So for a arbitrary deep network the shallow network size is definitely exponential in input depth, and while it's not clear whether that's true for typical 'trained' deep networks, I would be surprised if it wasn't.
I meant 'trained' networks in contrast to totally-synthetic intentionally-pathological networks like my 2abs-1 example. (It could be that only pathological cases give bad results - similar to how (Hindley-Milner) type inference is EXPTIME-complete but works fine for ~all real world programs - but that seems unlikely.)
The best use case would be if you can efficiently switch between representations. Shallow nets don't suffer from the vanishing gradient problem, so you could have very, very deep RNNs, convert to a shallow net for the gradient update, and then convert back to an RNN for the next rollout. Of course, most RNNs track an internal state, which isn't considered in the paper, but overcoming that limitation could make RNN training effective for long enough rollouts that transformers are no longer needed, especially given that RNNs scale better for long sequences than transformers do.
If you can go back and forth between representations with better than quadratic scaling, it would mean RNNs with infinite context lengths and no vanishing gradient. You wouldn't need transformers.
I think you're being a little too quick to jump to the answer. There are plenty of functions where this isn't true. Two easy examples are jump discontinuities (one that looks like this ____----- but more wiggly) (known as piecewise continuous) and functions with hole. These cannot be approximated to arbitrary accuracy, even with an infinite number of them.
You need to pay close attention to the wording in the approximation theorems.
Hornik[0] which does the proof you're discussing says
> This paper rigorously establishes that standard multilayer feedforward networks with as few as one _hidden layer_ using arbitrary squashing functions are capable of approximating any __Borel measurable function__ from one __finite__ dimensional space to another to any desired degree of accuracy
The confusion is probably in the Borel sigma-algebra. Borel means that it includes all finite open intervals in the real numbers. So (0,1) but not [0,1]. Open means boundary is not included! The interval also needs to be continuous, so our discontinuities violate the assumptions. Funahashi's[1] and Cybenko's[2] also require continuous functions.
This is actually a really important thing that gets ignored because it "seems obvious." Or maybe we just see it too often. But it is __critical__ to pay attention to assumptions and limitations. There is a lot of that going off the rails lately with ML and it's going to give us some roadblocks (we're already seeing some[side note]). EVERYTHING (and I mean literally everything) has assumptions, and therefor biases[3]. Every evaluation method you use has a bias. Every learning method you use has a bias. Every dataset. Every architecture. Everything. This is because everything has a certain number of assumptions baked in. We try to reduce these and make them as sane as possible, but we should be aware of them. And in the case of the universal approximation, well the limitation is fairly meaningful. Data is not guaranteed to lie upon a smooth continuous manifold.
[side note] We actually see this a lot in evaluation, and this is why benchmarkism is so problematic. Because it causes us to look at results as hard signals instead of guides. Evaluation is fucking hard. No empirical results will ever give you the full answer: the map is not the territory. We saw a hugging face blog today[4] about the LLM results differing and the reason is because the different evaluation methods biased towards different models. This means the metrics can be hacked, even specifically by the RLHF tuning. Or even the tokenization. These things are hard to make real good judgements on if you don't know the limits of the evaluation method (i.e. the assumptions it makes).
> Borel means that it includes all finite open intervals in the real numbers. So (0,1) but not [0,1].
The Borel algebra is generated by open sets, but it includes complements (and therefore closed sets) as well. In fact it's also generated by closed sets. The Borel algebra of R also contains sets like the rationals and the irrationals. The types of sets that aren't included in the Borel algebra (but are in the Lebesgue) are nasty things like (some) subsets of the cantor set.
Could you please stop posting unsubstantive comments and flamebait? You've unfortunately been doing it repeatedly. It's not what this site is for, and destroys what it is for.
If you know more than others, that's great—please share some of what you know, so the rest of us can learn something, or else don't post. Sneers and putdowns only make everything worse.
The benefits here are not really to do with implementing inference, but rather the improvement in explainability that a shallow network can provide. Skip to section 5 (page 8) for that.
If anything inference will probably get worse. Typically in ML constraining your architecture (which is what you're doing when you make a deep narrow network out of a shallow wide network, or rnns or grus/lstms, or transformers out of fcnns) yields inference and training performance benefits.
So any N-layer network with the relu _/ activation function can be restated in terms of a 3 layer network as long as the 3 layers can have infinite parameters and as long as you have near infinite computing power to convert the big ones. Cool but not yet useful.
> So any N-layer network with the relu _/ activation function
The paper talks only about models with additive operations among activations. It doesn't say anything about more complex networks like transformers.
In transformers there are multiplicative interactions between activations inside the attention matrix, it is unclear if they can be approximated with just a 3 layer ReLU network, or if such conversion would be practical at all.
> How well does a classic deep net architecture like AlexNet or VGG19 classify on a
standard dataset such as CIFAR-10 when its “width”— namely, number of channels
in convolutional layers, and number of nodes in fully-connected internal layers —
is allowed to increase to infinity? Such questions have come to the forefront in the
quest to theoretically understand deep learning and its mysteries about optimization
and generalization. They also connect deep learning to notions such as Gaussian
processes and kernels. A recent paper [Jacot et al., 2018] introduced the Neural
Tangent Kernel (NTK) which captures the behavior of fully-connected deep nets in
the infinite width limit trained by gradient descent; this object was implicit in some
other recent papers. An attraction of such ideas is that a pure kernel-based method
is used to capture the power of a fully-trained deep net of infinite width.
> It has long been known that a single-layer fully-connected neural network with an i.i.d. prior over its parameters is equivalent to a Gaussian process (GP), in the limit of infinite network width.
And of course, one needs to look back at SVMs applying a kernel function and separating with a line, which looks a lot like an ANN with a single hidden layer followed by a linear mapping.
Nice! I have to admit I didn't follow all the math, but the layers get really wide, right? You still need to describe all the partitions between the pieces of the piecewise linear functions. Does this save any computation at inference time?
My hunch is that you can really easily find (adversarial) counterexamples that may show that in the worst case you will always use more compute by shallow-ing. In my head it should be related to no free lunch theorem.
I'm not an expert, but I wonder if the wide network could then be trimmed down for special purposes? E.g., find the parts of the network needed for a certain task by looking at activations on inputs and dump the rest.
Besides pruning (which is what you describe) you might find the "lottery ticket hypothesis" interesting. The idea is essentially that in a randomly initialized large network there is a subset of weights that are already closeish for a given task and training helps tune that and suppress everything else, and that knowing this you can actually get better results by pruning before training.
I don't think you need induction, you can just use the fact that the input is a fixed size and the output is a fixed size. The "constructive" proof is, feed in all possible inputs, record the outputs in a lookup table.
Which obviously only works for the training data. It's a good example to remond that the whole point is to predict unseen input output pairs (generalization) so what is important is not so much the ability to fit a function, but to interpolate and extrapolate that function. And different bases and different fitting algorithms will have different behaviour in that respect.
Ah, get your point and misunderstood. I thought you were talking about comparing the neural network to a lookup table, not modeling the network itself. In that case the proof is only true if you consider the digital implementation of the neural network. Since it's a continuous function this proof would be impossible mathematically, as the domain is not enumerable. But if you consider only every possible float32 for example, then it works.
In any case it's kind of a useless statement that way as it says nothing about neural networks. You can replace "neural network" with "function" and it still works.
You're right, but the idea of looking things up instead of computing them can be useful when we are constrained by the available compute power. I'm not talking about simple lookup tables, of course, but if you look at recent trends in large foundational models, there's a lot of interest in efficient access to external information, or ways to pay attention to the inputs selectively, rather than in all-to-all fashion (e.g. landmark attention tokens).
Oh yeah I didn't see the current discussion as related to that but I find the topic of fact databases for LLMs pretty interesting, thanks for drawing the analogy.
Another maybe silly thought, but if any relu network could be written in three layers, does this allow for a hyper efficient ReLu only hardware for ml where the pipeline is fixed?
Actually I know because I also tried posting it but wasn't allowed because it was a dupe (4 or 5 hours old) - instead I just got taken to that post and was automatically deemed to have voted for it.