Yeah where are the bfloat16 numbers for the neural engine? For AMD you can at least divide by four to get the real number. 16 TOPS -> 4 tflops within a mobile power envelope is pretty good for assisting CPU only inference on device. Not so good if you want to run an inference server but that wasn't the goal in the first place.
What irritates me the most though is people comparing a mobile accelerator with an extreme high end desktop GPU. Some models only run on a dual GPU stack of those. Smaller GPUs are not worth the money. NPUs are primarily eating the lunch of low end GPUs.
RTX 4090 Tensor 1,321 TOPS according to spec sheet so roughly 35x.
RTX 4090 is 191 Tensor TFLOPS vs M2 5.6 TFLOPS (M3 is tough to find spec).
RTX 4090 is also 1.5 years old.