> Their key innovation was the idea that connections that were pruned after the network was trained might never have been necessary at all. To test this hypothesis, they tried training the exact same network again, but without the pruned connections. Importantly, they "reset" each connection to the weight it was assigned at the beginning of training.
> “It was surprising to see that re-setting a well-performing network would often result in something better,” says Carbin.
This, intuitively, makes sense to me. It seems that the pruned model has to waste less training cycles on inferior weights, and it can therefore spend more cycles on further optimizing the good weights.
This is neat, but one important point is that networks NEED to be significantly over parameterized for SGD like methods to find the minima. I know that's slightly tangential.
Another interesting point is that NNs can be thought about as preforming coordinate transformations on the data manifold, which means these sub nets are potentially approximations to those transformations, potentially up to some scaling factor.
This is an interesting idea that may be combined with the article suggested idea.
In the article, if I understood correctly, what they propose is train your network once, remove the x% of smaller absolute magnitude weights, retrain your network fixing those smaller magnitude weights to 0 starting from the same initial starting point.
The idea behind is that your optimization process the first time is telling you that the solution is near a subspace where the weights are 0 but it can't really converge to it. So you project to this subspace by enforcing the weights to 0. Then you retrain again and the search will be easier because the space is smaller, but because you are starting from the same starting point you are kind of guaranteed that you will be able to reach the same optimum but projected.
The problem of the sparsity in the article is that while some weights are 0, they are 0 through masking, therefore you are still doing the computations, and you don't really benefit from the sparsity. If you have enough 0, you can benefit from the sparsity by using some sparse representation, but those are typically an order of magnitude slower than the dense representation.
Combining the idea of the article with the idea from OpenAI of block-sparse neural networks which reduce the operations done without suffering too much from the non-locality and indirection of a sparse representation.
After training normally (provided you have enough memory) (eventually with a sparse-block regularization term to help induce block-sparsity) you may try to prune in such a way that the least significant sparse-blocks are pruned, therefore you may expect both the boost in speed, and the better accuracy and convergence properties.
This is a kind of two phase search, first we look for a finer structure, then we restart to find the best weights for this finer structure.
Results from above suggest that you can leave 12.5% of your weights to be non-zero and get nice 4x speed up (1/8 operations done twice as slow). Accuracy start to drop at about 12.5% of weights remaining in both OP and OpenAI papers.
Based on what you described, it feels like the MIT paper and the openai paper are essentially the same thing. The only difference is the masking/pruning part, which I think is just an engineering detail.
Sorry If I mis-conveyed the ideas. They are quite different.
The openai paper is introducing operations which are a fast middle ground between dense and sparse operations. You still have to specify the sparsity structure you like. (Although often some random sparsity structure work well).
The MIT paper describe one way to choose a sparsity structure and starting point which will work well in the general case.
The OpenAI approach is more amenable to an obvious HW implementation with the block sparsity because the blocks are are GEMM operations are implemented in the first place.
There are obviously more available sparse solutions if the block sparsity constraint is relaxed therefore I wouldn't be surprised if the best results come from such a network.
Yes in the OpenAI paper you don't need to learn the sparsity structure. But afaiu the mit paper help suggest an appropriate sparsity structure.
Concerning the sparsity speed-up, I've tried the tensorflow sparse representation a while back, and it was kind of a high effort, low reward process. You had to drop like 90% of the weights (was reducing accuracy), change your ops (dense, convolutions,...) and use big enough layer size to get a feel you were getting something speed-wise.
The openai block-sparse kernels seems promising I'll give them a try.
I have thought for a while about "brain surgery" for deep neural nets, much like a permanent kind of dropout.
The idea is that you black out a set of neurons/filters and then train for a short while to overcome the performance penalty. To find the "set" of blacked out cells you could use a genetic algorithm or something, gradually increasing the number of masks.
The last step would be rearranging the network such that the not-blacked-out cells are contiguous, but form smaller layers.
And I remember Hinton hinted at replacing multiple layers by one layer (or no layer), or big layers by smaller layers through retraining.
Well, there are long standing network pruning techniques like Optimal Brain Damage (1990) [1] and Optimal Brain Surgeon [2]. Could you be more concrete about what you would be trying to do?
We should train another neural network to produce the initial values for the target network. Training data would be the initial values of the sub network along with its loss. Then we have the network generate initial values that are more likely to lead yo useful sub networks.
This is essentially the idea behind “meta-learning,” or “learning to learn,” with the slight difference that most of the meta-learning literature aims to initialize networks that can learn quickly (“few-shot learning”). It has pretty good theoretical grounding but in practice seems to be quite expensive.
Lol. I've seen idea after idea, dismissed and misunderstood on Internet forums, from hackernews to reddit.com/r/machine learning. Time after time I see ideas, my own and those of others, that were dismissed, spawning papers by researchers.
Such a toxic atmosphere, I don't know why I bother posting.
I am well aware of what meta learning is. If you think that because the idea above comes under the umbrella of metalearning it isn't an interesting idea, then I don't know what to say to you... and the fact that it's downvoted just goes to show the lack of creativity and intuition exhibited by your average hackernews member.
I never implied that it wasn’t an interesting idea, and your weird defensiveness suggests to me that there might be a common thread to your getting repeatedly dismissed/downvoted other than misunderstood genius.
I never understood the obsessive emphasis on the training of neural networks. Many people see neural networks as objects that you train. But this is just a small thing that you can do with them. Neural networks are, first and foremost, objects that compute. This computation can be tuned by setting some parameters, and one way to set these parameters is by training. But this is not the only way, and it is not necessarily the more interesting one.
Well, that's a bit like asking "what's with the obsessive emphasis on programming computers, after all they're just objects that compute?"
Yes, neural networks are objects that compute; there's even this "universal approximator" theorem that says a basic, albeit sufficiently large, neural network can approximate any arbitrary function (from a broad class of functions) to arbitrary precision. However, the theorem says nothing about whether you'll ever actually _find_ the neural network that corresponds to that function. This is what training is for, it allows us to find (the parameters of) the NN that we want to do some computation.
In other words, training is how we program NNs, but in general it can be really hard to arrive at the "program" you're looking for.
Your point is that training is computationally intensive, but your question is "why do people obsess over training"? Sounds like you've answered your own question, then. It's currently hard to train networks, so people "obsess" over improving the methods (see, for example, the article we're commenting on) so that it doesn't have to take as long.
But also note that what you say is not necessarily true that the NN's spend most of their time training. Maybe you've got to spend a week on a huge GPU cluster training some autonomous-driving algorithm, but then it runs in "compute" mode for hours a day in tens of thousands of cars.
The one thing I do not like about neural networks is that they completely abstract the programmer from the problem they are solving. After the neural network is extensively trained, it’s very difficult for even the creator to look at the resulting data structure and make sense of how the problem is actually being solved. I think this is one major reason why neural networks successes have been so confined to specific problems; it’s nearly impossible to predictably modify and improve on a duly functioning and trained multi hidden layer network without starting from scratch and completely retraining it.
Good practitioners do both. Whatever knowledge about the weights you can predetermine, you set those as the initial values of your weights. Then you improve on that by training. It really depends on the problem and size of your network how many weights you'll have a good estimate for but yes, you should try to estimate good weights yourself first.
This practice is well known but here's a concrete source from Andrej Karpathy:
"init well. Initialize the final layer weights correctly. E.g. if you are regressing some values that have a mean of 50 then initialize the final bias to 50. If you have an imbalanced dataset of a ratio 1:10 of positives:negatives, set the bias on your logits such that your network predicts probability of 0.1 at initialization. Setting these correctly will speed up convergence and eliminate “hockey stick” loss curves where in the first few iteration your network is basically just learning the bias."
Good luck manually writing a program that achieves state of the art performance on any typical machine learning task... we'll see which "training" method takes longer.
> “It was surprising to see that re-setting a well-performing network would often result in something better,” says Carbin.
This, intuitively, makes sense to me. It seems that the pruned model has to waste less training cycles on inferior weights, and it can therefore spend more cycles on further optimizing the good weights.