So please let me know if I am wrong are you guys running a batch size of 1 in 500 GPU's? then why are the responses almost instant if you guys are using batch size 1 and also when can we expect bring your own fine tuned models kind of thing. Thanks!
We are not using 500 GPUs, we are using a large system built from many of our own custom ASICs. This allows us to do batch size 1 with no reduction in overall throughput. (We are doing pipelining though, so many users are using the same system at once).