Learning Depth-Three Neural Networks in Polynomial Time

activatedgeek · on Oct 29, 2017

I will try to provide an intuition for the uninitiated to get started with understanding the results in this paper.

In Supervised Machine Learning, our only goal is to predict the underlying "unknown" distribution given a set of data instances from the same. Let us say, I gave you set of images of cats and dogs to be classified. As a machine learning engineer, you would probably build me a Binary Classifier on top of an SVM and all done. But, how would you justify it to me that how well your algorithm generalizes? Or, in other words, how would you provide me a guarantee that your model will work "well" on images that you haven't seen yet. What would a model working "well" mean? If you recall SVM basics, one tries to build a max-margin hyperplane in some high-dimensional space. We will call the set of all such hyperplanes as our "concept class" and this is what you are trying to learn to be technical (hold that thought). When you finally give me your SVM model with some parameters, we will call it the target concept. Now, coming back to the point of "well"-ness of SVM, what guarantees can you provide me when I provide you images outside your training set but from the same underlying distribution (to be fair to you). We call the error you make during your training as the "empirical risk" and the error you make outside your training set on unseen data as the "risk". Our aim is to minimize "risk", or in words to "generalize" well.

In Learning Theory, we are concerned with defining the "well"-ness of a learning algorithm. More generally, we are interested in answering the question "what is learnable and how well?". The fundamental answer to that question is a theory known as the PAC (Probably Approximately Correct) Theory [1].

Paraphrasing in plain English (I'd recommend you to take a look at the formal definition affirmatively right after this), it states that learning is tractable if and only if we are able to provide a polynomial time algorithm such that for any possible set of samples from an unknown underlying distribution, we can provide an upper bound for the generalized error/risk with a confidence measure. This also has to happen so that we are able to provide a polynomial lower bound for the number of samples we use to run the algorithm.

Skipping a few other details, PAC theory is powerful because it provides a statistical framework to quantify the "wellness" in a probabilistic setting. On top of this, it makes absolutely no assumption about the underlying distribution and hence is "distribution-free". This means that whatever learning algorithm you provide, it should work on any distribution possible (with added technicality of the number of training samples needed being polynomial in terms of the error and confidence measure).

To get a more intuitive understanding, consider the prediction to the series of numbers "1,2,4,8,16,_". Your first guess most likely will be "32" because it seems like a GP with ratio 2. But I say that the next number is actually "12345". Am I wrong? No. Were you wrong? No. The take away is that true learning is intractable when the underlying distribution is truly "unknown" (like in this case). Instead, I will quantify the correctness and tell you that I am 99% confident that the the generalization error for unseen points of my predictions will be at most 10%. Note that those numbers are arbitrary to get a point across.

Now that we have PAC-theory in place, many times it is hard to use this tool because it is very strict, instead we rely on distribution-dependent bounds which might be more easily computable in some scenarios like the Rademacher complexity, Growth Function or VC Dimension.

Coming back to the paper, Neural Networks have been a hard nut to crack in terms of such theoretical bounds because sometimes these bounds tend to be trivial and useless (e,g, Probability <= 10). First the paper introduces lots of probabilistic tools on the basis of which it claims to have found an "efficient" (because polynomial time) PAC algorithm for learning intersections of polynomially many halfspaces.

[1] A Theory of the Learnable (http://web.mit.edu/6.435/www/Valiant84.pdf)

dmichulke · on Oct 29, 2017

I applied a few things in that field but my knowledge may be a bit rusty.

Still, learning 3 layer NN in polynomial time is a big step forward because this entails we don't get stuck at an error (or at least we now know that we got stuck) which entails a probably non-random weight initialization. This is a big advantage for using and testing simple NN.

It might also finally lend a concrete interpretation to any NN, removing its black-box nature and increasing their adoption.

That said, I haven't read the article.

_0ffh · on Oct 29, 2017

Question about "We give a polynomial-time algorithm for learning neural networks with one hidden layer of sigmoids feeding into any smooth, monotone activation function (e.g., sigmoid or ReLU)":

ReLU is piecewise linear, I always thought that was non-smooth. Did I get that wrong?

isoprophlex · on Oct 29, 2017

I think the discontinuity in the first derivative is smoothed or interpolated somehow, and that this also happens in existing nn approaches where a gradient must be computed...

I could be wrong though

laretluval · on Oct 29, 2017

In conjunction with the universal approximation theorem, does this mean that this algorithm can learn all p-concepts in polynomial time?

yorwba · on Oct 30, 2017

Polynomial in the number of hidden units you need to express the network, which may very well be exponential in the input dimension.

deepnotderp · on Oct 30, 2017

This paper seems somewhat suspect.

For one, "depth-three" implies three layers, and in the standard terminology of the field what they really mean is "depth one".

And another major red flag:

We give a polynomial-time algorithm for learning neural networks with one hidden layer of sigmoids feeding into any smooth, monotone activation function (e.g., sigmoid or ReLU)

How is ReLU smooth?

yorwba · on Oct 30, 2017

They cite https://arxiv.org/abs/1610.09887 for their definition of network depth, which defines it in such a way that e.g. a ReLU network of depth 2 is of the form linear2(ReLu(linear1(input))). That means, depth is the number of linear layers.

The "depth-three" model in this paper is a bit strange in that their second layer has only one output, so the third linear layer doesn't have any effect. I would have called this "depth two"; but it is internally consistent with their definition of depth.

> How is ReLU smooth?

It is 1-Lipschitz, which is smooth enough for them.

man-and-laptop · on Oct 30, 2017

> The "depth-three" model in this paper is a bit strange in that their second layer has only one output, so the third linear layer doesn't have any effect. I would have called this "depth two"; but it is internally consistent with their definition of depth.

No, it does have an effect: It takes a linear combination of the outputs of the previous layer, and then applies a non-linearity $\sigma'$. If $\sigma'$ is the logistic function, then the output of the last layer is a probability.

yorwba · on Oct 30, 2017

No, the last layer is just a linear layer, there's no non-linearity. That's what's strange about their definition. The depth-three network only applies a non-linearity twice, which would conventionally be labeled as depth two.

man-and-laptop · on Oct 30, 2017

You're wrong. See section 5.1. I've drawn a graph of it that shows it to be a classical NN with a single output node.

yorwba · on Oct 31, 2017

Yes, it is a classical NN with a single output node. I'm not disputing that, I just think their calculation of depth is strange. The network only applies the sigmoid function twice, and would ordinarily be regarded as having a depth of two. The third linear layer is fixed to multiplying by 1, which is what I meant by "has no effect". (Did you miss that I was talking about the third layer?)

man-and-laptop · on Nov 1, 2017

Here's a diagram [1] of a one-hidden-layer NN. It has 2 activation functions. Their NN is of the same type.

[1] - https://raw.githubusercontent.com/qingkaikong/blog/master/40...

yorwba · on Nov 2, 2017

The question is, how deep is that network?

m3kw9 · on Oct 29, 2017

That’s a loaded sentence with 2 terms i have not Idea what they are. Depth 3 and polynomial time

se4u · on Oct 29, 2017

@averagewall already explained what depth 3 is.

The explanation of poly time is a little bit lengthier. Let me briefly explain the PAC learning paradigm in the context of this paper.

Consider the set of all depth 3 neural networks with n dimensional input and k dimensional hidden layer whose weights have total norm 1. Let this set be called H. Note that this is a well defined set, parameterized by the weights of the neural networks.

Let ∊ be an arbitrary positive error value and let δ be an arbitrary prob value. We are interested in ∊ ≈ 0 and δ ≈ 0. The solution of the PAC learning problem is an algorithm, say A, which takes a dataset D as input, generated from a dsitribution Π, and which runs in time that is polynomial in the size of D, and spits out a specific neural network h that lies in H. Furthermore, no matter what Π was, the probability that the neural net h, that A produced, has error that is at max ∊ above the optimal error achievable by any member of H, should be more than 1-δ. This is the PAC learning paradigm.

The paper assumes that Π is a distribution over the n-dimensional unit sphere, so all examples have the same euclidean norm and constrains the weight vectors to have unit norm, AND WITH THESE ASSUMPTIONS, it gives a algorithm that requires |D| = poly(n, k, 1/∊) and runs in time poly(D).

averagewall · on Oct 29, 2017

The abstract says depth-3 means a NN with 1 hidden layer. So I guess the 3 refers to one input layer, one hidden layer, and one output layer.

Polynomial time roughly means "fast enough that it could be practical at large scale if you have enough computing power". It's in contrast to exponential time which means "no amount of money can ever buy enough computers to do it at scale".

My question is what's the existing learning time for such a network? Maybe it's already just as good but these guys have a proof that it works in general instead of just hoping and finding that it usually seems to be fast enough?

yorwba · on Oct 29, 2017

While the idea that "polynomial time is fast and exponential time is slow" is quite popular, it's important to realize that those are descriptions of the growth in computation time for ever larger input sizes.

In practice, you are usually only interested in a limited range of inputs, for which an exponential time algorithm may turn out to be faster. If you really want the fastest algorithm, you need to actually measure the runtime on realistic data.

taeric · on Oct 29, 2017

My question is more of how many shallow neural networks are still in use. The deep in deep learning is typically much greater than three.

stochastic_monk · on Oct 29, 2017

You're absolutely right. But this is still a huge step forward. There has been work on the VC dimension of neural networks for a long time (and it's been shown to be finite), which is a necessary but not sufficient condition for efficient PAC learnability.

If it can be done for 3 layers, then maybe it can be done for more. And I happen to really like it when my problems have polynomial time guarantees.

[VC dimension is a quantitative measure related to the complexity that a model's parameters allow it to express. I think of it as analogous to entropy for physical systems.]

srean · on Oct 29, 2017

Apart from deep wagging contests this might become less relevant if we go by the notion "poly time == solved". The reason why this is so is that 3 layer Neural Networks are uniform approximators of smooth functions.

I have to read the paper carefully for assumptions made, but if the result holds true for the entire class of 3 layer NNs, or a class that is big enough not to sacrifice uniform approximation then this would be Big Deal.

Of course for practical applications, poly-time may not mean that it is a solved problem, or that the poly-time algorithm is the best algorithm to use on a typical instance. The exponent or the leading constant could be very high, and deeper networks may well offer more conducive properties.

rspeer · on Oct 29, 2017

I think this assessment of what's "typical" may just be based on posturing. For some reason you can't brag about training a relatively shallow neural network (is efficiency not valued in this field?).

My counter-assessment is that the space of problems you solve just by making your NN deeper is very small. Such problems clearly exist, but I've seen few compelling examples outside of image classification.

master_yoda_1 · on Oct 29, 2017

That is a classical problem with hacker rank. Nobody get it but its still trending high :)

bluetwo · on Oct 29, 2017

Holy cow. I don't get it.

Choco31415 · on Oct 29, 2017

See m3kw9’s comment. Responses include explanations for some of the terminology.

bluetwo · on Oct 30, 2017

Thanks.