I don't understand the random label training part. Presumably you train on randomised labels which have no relationship with the input but surely it won't generalise well at all given the small probability of predicting the labels correctly by chance (The setup for a Permutation test am I wrong)?
If you look at Table 1, you see that the models manage to train almost 100% correctly on the randomized labels, but crucially the control test score is down in the 10% region. This is in stark comparison to roughly 80-90% test score for the properly labeled data.
So it seems to me that when faced with structured data they manage to generalize the structure somehow, while when faced with random training data they're powerful enough to simply memorize the training data.
edit: so just point point out, obviously it's to be expected the test to be bad for random input, after all how can you properly test classification of random data?
So the point, as I understand it, isn't that the randomized input leads to poor test results, but rather that the non-randomized ones manages to generalize despite it being capable of simply memorizing the input.
AFAIK that's right, it would be very unlikely to generalize on random labels, which is why I read the comment as suggesting the network shouldn't have low training loss in that situation.