Oh, very good question - tbh im not sure. Another close technique is layer offlo...

		danielhanchen on March 31, 2024 \| parent \| context \| favorite \| on: Towards 1-bit Machine Learning Models Oh, very good question - tbh im not sure. Another close technique is layer offloading - if your network can't fit and has layers 1, 2, ..., 32, we offload layers 16 to 32 to RAM, then load them in to GPU memory on the fly. I'm gonna guess the performance hit is similar - although I have not tried it myself to verify for benchmarking