Smarter Training of Neural Networks

vincentmarle · on May 7, 2019

> Their key innovation was the idea that connections that were pruned after the network was trained might never have been necessary at all. To test this hypothesis, they tried training the exact same network again, but without the pruned connections. Importantly, they "reset" each connection to the weight it was assigned at the beginning of training.

> “It was surprising to see that re-setting a well-performing network would often result in something better,” says Carbin.

This, intuitively, makes sense to me. It seems that the pruned model has to waste less training cycles on inferior weights, and it can therefore spend more cycles on further optimizing the good weights.

dplarson · on May 7, 2019

Since the article didn't link to the paper:

- "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks"

- https://arxiv.org/abs/1803.03635

vincentmarle · on May 7, 2019

There's actually a link to the paper in the article's right sidebar: https://openreview.net/forum?id=rJl-b3RcF7 (the reviews here are also an interesting read)

joshvm · on May 7, 2019

Thanks for that, nothing has changed in academic review!

Reviewer 2, 3: 9/10 great.

Reviewer 1: "The paper seems a bit preliminary and unfinished."

Authors go on to win best paper with their submission.

eli_gottlieb · on May 7, 2019

Man, after getting beaten up in some recent reviews, I needed to see that.

Chirono · on May 7, 2019

Google recently created MorphNet which can take a large network and do the pruning stage automatically. https://ai.googleblog.com/2019/04/morphnet-towards-faster-an...

andrewnc · on May 7, 2019

This is neat, but one important point is that networks NEED to be significantly over parameterized for SGD like methods to find the minima. I know that's slightly tangential.

Another interesting point is that NNs can be thought about as preforming coordinate transformations on the data manifold, which means these sub nets are potentially approximations to those transformations, potentially up to some scaling factor.

I'm excited to see where this goes

sjg007 · on May 7, 2019

>networks NEED to be significantly over parameterized for SGD like methods to find the minima

Why?

bigred100 · on May 8, 2019

I’ve only seen a small amount on this but the theoretical analysis I’ve seen used this as an assumption

There may be some in here

https://youtu.be/zZbHVaw_W9A

thesz · on May 7, 2019

I have to add this: https://openai.com/blog/block-sparse-gpu-kernels/

They train block-sparse neural networks, where sparseness learned during training.

GistNoesis · on May 7, 2019

This is an interesting idea that may be combined with the article suggested idea.

In the article, if I understood correctly, what they propose is train your network once, remove the x% of smaller absolute magnitude weights, retrain your network fixing those smaller magnitude weights to 0 starting from the same initial starting point.

The idea behind is that your optimization process the first time is telling you that the solution is near a subspace where the weights are 0 but it can't really converge to it. So you project to this subspace by enforcing the weights to 0. Then you retrain again and the search will be easier because the space is smaller, but because you are starting from the same starting point you are kind of guaranteed that you will be able to reach the same optimum but projected.

The problem of the sparsity in the article is that while some weights are 0, they are 0 through masking, therefore you are still doing the computations, and you don't really benefit from the sparsity. If you have enough 0, you can benefit from the sparsity by using some sparse representation, but those are typically an order of magnitude slower than the dense representation.

Combining the idea of the article with the idea from OpenAI of block-sparse neural networks which reduce the operations done without suffering too much from the non-locality and indirection of a sparse representation.

After training normally (provided you have enough memory) (eventually with a sparse-block regularization term to help induce block-sparsity) you may try to prune in such a way that the least significant sparse-blocks are pruned, therefore you may expect both the boost in speed, and the better accuracy and convergence properties.

This is a kind of two phase search, first we look for a finer structure, then we restart to find the best weights for this finer structure.

thesz · on May 7, 2019

You can use gather operations: https://stackoverflow.com/questions/22330322/dereference-poi...

And they are not "order of magnitude slower": https://stackoverflow.com/questions/24756534/in-what-situati... Unless you operate in binary, of course.

Results from above suggest that you can leave 12.5% of your weights to be non-zero and get nice 4x speed up (1/8 operations done twice as slow). Accuracy start to drop at about 12.5% of weights remaining in both OP and OpenAI papers.

billconan · on May 7, 2019

Based on what you described, it feels like the MIT paper and the openai paper are essentially the same thing. The only difference is the masking/pruning part, which I think is just an engineering detail.

GistNoesis · on May 7, 2019

Sorry If I mis-conveyed the ideas. They are quite different.

The openai paper is introducing operations which are a fast middle ground between dense and sparse operations. You still have to specify the sparsity structure you like. (Although often some random sparsity structure work well).

The MIT paper describe one way to choose a sparsity structure and starting point which will work well in the general case.

varelse · on May 7, 2019

The OpenAI approach is more amenable to an obvious HW implementation with the block sparsity because the blocks are are GEMM operations are implemented in the first place.

There are obviously more available sparse solutions if the block sparsity constraint is relaxed therefore I wouldn't be surprised if the best results come from such a network.

thesz · on May 7, 2019

The openai paper presents a way to learn sparsity - the block-sparse structure with blocks bigger than 1x1 has been chosen for efficiency reasons.

You may as well learn block-sparse architecture with 1x1 blocks, effectively doing what MIT was doing, but without two phases.

thesz · on May 7, 2019

And I forgot to emphasize - you can select what weights are non-zero beforehand, even before training anything. The network will route around that.

This means you do not need any training to decide what has to be zeroed.

GistNoesis · on May 7, 2019

Yes in the OpenAI paper you don't need to learn the sparsity structure. But afaiu the mit paper help suggest an appropriate sparsity structure.

Concerning the sparsity speed-up, I've tried the tensorflow sparse representation a while back, and it was kind of a high effort, low reward process. You had to drop like 90% of the weights (was reducing accuracy), change your ops (dense, convolutions,...) and use big enough layer size to get a feel you were getting something speed-wise.

The openai block-sparse kernels seems promising I'll give them a try.

bayesian_horse · on May 7, 2019

I have thought for a while about "brain surgery" for deep neural nets, much like a permanent kind of dropout.

The idea is that you black out a set of neurons/filters and then train for a short while to overcome the performance penalty. To find the "set" of blacked out cells you could use a genetic algorithm or something, gradually increasing the number of masks.

The last step would be rearranging the network such that the not-blacked-out cells are contiguous, but form smaller layers.

And I remember Hinton hinted at replacing multiple layers by one layer (or no layer), or big layers by smaller layers through retraining.

_0ffh · on May 7, 2019

Well, there are long standing network pruning techniques like Optimal Brain Damage (1990) [1] and Optimal Brain Surgeon [2]. Could you be more concrete about what you would be trying to do?

[1] https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32....

[2] https://papers.nips.cc/paper/647-second-order-derivatives-fo...

bayesian_horse · on May 7, 2019

I didn't invest any serious work in this idea, and didn't even research if somebody has come up with such things before.

unixhero · on May 7, 2019

And then what would happen? What is the effect of this? Very deep an interesting thoughts.

space_fountain · on May 7, 2019

I think mostly faster training time. If you can work out connections that don't matter early in the process you could start ignoring them.

bayesian_horse · on May 7, 2019

My idea was more about faster inference by reducing the required number of operations. It might not be worthwhile at all!

plutonorm · on May 7, 2019

We should train another neural network to produce the initial values for the target network. Training data would be the initial values of the sub network along with its loss. Then we have the network generate initial values that are more likely to lead yo useful sub networks.

intuitionist · on May 7, 2019

This is essentially the idea behind “meta-learning,” or “learning to learn,” with the slight difference that most of the meta-learning literature aims to initialize networks that can learn quickly (“few-shot learning”). It has pretty good theoretical grounding but in practice seems to be quite expensive.

plutonorm · on May 7, 2019

Lol. I've seen idea after idea, dismissed and misunderstood on Internet forums, from hackernews to reddit.com/r/machine learning. Time after time I see ideas, my own and those of others, that were dismissed, spawning papers by researchers.

Such a toxic atmosphere, I don't know why I bother posting.

I am well aware of what meta learning is. If you think that because the idea above comes under the umbrella of metalearning it isn't an interesting idea, then I don't know what to say to you... and the fact that it's downvoted just goes to show the lack of creativity and intuition exhibited by your average hackernews member.

intuitionist · on May 7, 2019

I never implied that it wasn’t an interesting idea, and your weird defensiveness suggests to me that there might be a common thread to your getting repeatedly dismissed/downvoted other than misunderstood genius.

bitforger · on May 7, 2019

Recent follow-up work from Uber at ICLR: https://eng.uber.com/deconstructing-lottery-tickets/

iovrthoughtthis · on May 7, 2019

nature nurture but for neural networks?

enriquto · on May 7, 2019

I never understood the obsessive emphasis on the training of neural networks. Many people see neural networks as objects that you train. But this is just a small thing that you can do with them. Neural networks are, first and foremost, objects that compute. This computation can be tuned by setting some parameters, and one way to set these parameters is by training. But this is not the only way, and it is not necessarily the more interesting one.

cshimmin · on May 7, 2019

Well, that's a bit like asking "what's with the obsessive emphasis on programming computers, after all they're just objects that compute?"

Yes, neural networks are objects that compute; there's even this "universal approximator" theorem that says a basic, albeit sufficiently large, neural network can approximate any arbitrary function (from a broad class of functions) to arbitrary precision. However, the theorem says nothing about whether you'll ever actually _find_ the neural network that corresponds to that function. This is what training is for, it allows us to find (the parameters of) the NN that we want to do some computation.

In other words, training is how we program NNs, but in general it can be really hard to arrive at the "program" you're looking for.

enriquto · on May 7, 2019

> Well, that's a bit like asking "what's with the obsessive emphasis on programming computers, after all they're just objects that compute?"

Indeed! Most of the time computers are computing, not being programmed. Yet most of the time, neural networks are being trained instead of computing.

That was exactly my point.

cshimmin · on May 7, 2019

Your point is that training is computationally intensive, but your question is "why do people obsess over training"? Sounds like you've answered your own question, then. It's currently hard to train networks, so people "obsess" over improving the methods (see, for example, the article we're commenting on) so that it doesn't have to take as long.

But also note that what you say is not necessarily true that the NN's spend most of their time training. Maybe you've got to spend a week on a huge GPU cluster training some autonomous-driving algorithm, but then it runs in "compute" mode for hours a day in tens of thousands of cars.

bigred100 · on May 7, 2019

There’s a universal approximation theorem for polynomials too but that don’t necessarily mean much

deytempo · on May 7, 2019

The one thing I do not like about neural networks is that they completely abstract the programmer from the problem they are solving. After the neural network is extensively trained, it’s very difficult for even the creator to look at the resulting data structure and make sense of how the problem is actually being solved. I think this is one major reason why neural networks successes have been so confined to specific problems; it’s nearly impossible to predictably modify and improve on a duly functioning and trained multi hidden layer network without starting from scratch and completely retraining it.

hour_glass · on May 7, 2019

What does this mean? I can shape a network some other way?

enriquto · on May 7, 2019

Yes, you can think very hard and set up the weights by hand, or by an automatic "compiler" that transforms your desired algorithm into a deep net.

Ragib_Zaman · on May 7, 2019

Good practitioners do both. Whatever knowledge about the weights you can predetermine, you set those as the initial values of your weights. Then you improve on that by training. It really depends on the problem and size of your network how many weights you'll have a good estimate for but yes, you should try to estimate good weights yourself first.

This practice is well known but here's a concrete source from Andrej Karpathy:

https://karpathy.github.io/2019/04/25/recipe/

"init well. Initialize the final layer weights correctly. E.g. if you are regressing some values that have a mean of 50 then initialize the final bias to 50. If you have an imbalanced dataset of a ratio 1:10 of positives:negatives, set the bias on your logits such that your network predicts probability of 0.1 at initialization. Setting these correctly will speed up convergence and eliminate “hockey stick” loss curves where in the first few iteration your network is basically just learning the bias."

cshimmin · on May 7, 2019

Good luck manually writing a program that achieves state of the art performance on any typical machine learning task... we'll see which "training" method takes longer.