That is my reading too, if you consider latency as the utmost inference metric, ...

		karmasimida on April 25, 2024 \| parent \| context \| favorite \| on: Snowflake Arctic Instruct (128x3B MoE), largest op... That is my reading too, if you consider latency as the utmost inference metric, then you need all models in memory all the time. What is you guys 70B configuration, do you guys try TP=8 for the 70B model for a fair comparison?