* hardware spec
* inference engine
* specific model - differences to tokenizer will make models faster/slower with equivalent parameter count
* quantization used - and you need to be aware of hardware specific optimizations for particular quants
* kv cache settings
* input context size
* output token count
This is probably not a complete list either.
* hardware spec
* inference engine
* specific model - differences to tokenizer will make models faster/slower with equivalent parameter count
* quantization used - and you need to be aware of hardware specific optimizations for particular quants
* kv cache settings
* input context size
* output token count
This is probably not a complete list either.