Training requires a lot more memory to keep gradients + gradient stats for the o...

		robrenaud on Feb 20, 2024 \| parent \| context \| favorite \| on: Groq runs Mixtral 8x7B-32k with 500 T/s Training requires a lot more memory to keep gradients + gradient stats for the optimizer, and needs higher precision weights for the optimization. It's also much more parallelizable. But inference is kind of a subroutine of training.