Thanks! The general answer is that it depends on your model and on which FPGA pl...

touisteur · on March 11, 2022

Hi, I'm curious what you mean about model quantization being necessary on CPU and GPU? They're not necessary by default, as openvino, tvm, tensorrt can run single-precision inference on most classic models quite fast? If you're reaching for very low power or ultimate perf, yeah you can downgrade to fp16 (well... Mixed precision) with NVIDIA tensor cores or avx512-fp16, or bf16 in some Intel vnni confs? Going to integer will give you more throughput too but it's not necessary. Even myriad-x is supposed to handle some kind of fp16 with the shave cores.

The only time I had to reach for quantized (integer) networks to do anything at all was inferencing on FPGAs. Are you targeting dsp slices by default or implementing full ieee754 floating point by default?

Are you saying that with Tensil you can run single precision non-quantized models with up to 2x gpu perf?

I probably misunderstood your last sentence, sorry.

Genuinely curious!

tdba · on March 11, 2022

Sorry if this was unclear - in a datacenter use case you are right, but for an edge deployment, you will usually need to quantize, prune or compress your ML model to get it working as fast as you'd like on a sufficiently small CPU/GPU. Compared with running your ML model unchanged on those platforms, Tensil can run with the performance ranges listed above. You can also quantize and use Tensil too!

forgotmyoldacc · on March 12, 2022

It'd be great if you could add benchmark numbers for this comparing CPU/GPU on inference / sec and inference / watt.

tdba · on March 12, 2022

Will do - as I mentioned in another comment, it can be a bit subtle to find an apples-to-apples comparison, but we'll soon add some cross-platform that we think are reasonable.

37ef_ced3 · on March 12, 2022

Please compare against https://NN-512.com

tdba · on March 12, 2022

Sure, we'll check it out!