It's explained in the post: > Have you ever done a dense grid search over neural...

catlifeonmars · on Feb 20, 2024

I saw this, but am still not clear on what the axes represent. I assume two hyperparameters, or possibly two orthogonal principal components. I guess my point is it’s not clear how/which parameters are mapped onto the image.

bosco_mcnasty · on Feb 20, 2024

your point is valid but the paper explains it clearly and obviously. they are NOT dimensionally reduced hyperparameters, no. The hyperparameters are learning rates, that's it. X axis, learning rate for input (1 hidden layer). Y axis, learning rate for output layer.

So what this is saying, for certain ill-chosen learning weights, model convergence is for lack of a better word, chaotic and unstable.

ks1723 · on Feb 20, 2024

Just to add to this, only the two learning rates are changed, everything else including initialization and data is fixed. From the paper:

Training consists of 500 (sometimes 1000) iterations of full batch steepest gradient descent. Training is performed for a 2d grid of η0 and η1 hyperparameter values, with all other hyperparameters held fixed (including network initialization and training data).