Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Just to add to this, only the two learning rates are changed, everything else including initialization and data is fixed. From the paper:

Training consists of 500 (sometimes 1000) iterations of full batch steepest gradient descent. Training is performed for a 2d grid of η0 and η1 hyperparameter values, with all other hyperparameters held fixed (including network initialization and training data).



Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: