Usually you want to split each layer to run with tensor parallelism, which works...

		andersa on Aug 27, 2024 \| parent \| context \| favorite \| on: "Tinyboxes finally have a buy it now button" Usually you want to split each layer to run with tensor parallelism, which works optimally if you can assign each kv head to a specific GPU. All currently popular models have a power of 2 number of kv heads.

interesting, thank you for the pointers.