A decent chunk of AI computation is the ability to do matrix multiplication fast. Part of that is reducing the amount of data transferred to and from the matrix multiplication hardware on the NPU and GPU; memory bandwidth is a significant bottleneck. The article is highlighting 4-bit format use.
GPUs are an evolving target. New GPUs have tensor cores and support all kinds of interesting numeric formats, older GPUs don't support any of the formats that AI workloads are using today (e.g. BF16, int4, all the various smaller FP types).
NPU will be more efficient because it is much less general an GPU and doesn't have any gates for graphics. However, it is also fairly restricted. Cloud hardware is orders of magnitude faster (due to much higher compute resources I/O bandwidth), e.g. https://cloud.google.com/tpu/docs/v6e.
Agree on NPU vs CPU memory bandwidth, but not sure about characterizing the GPU that way. GDDR is usually faster than DDR of the same generation, and on higher end graphics cards has a width bus width. A few GPUs have HBM and pretty much all datacenter ML accelerators (NVidia B200 / H100 / A100, Google TPU, etc). The PCIe bus between the host memory and GPU memory is a bottleneck for intensive workloads.
To perform a multiplication on CPU, even SIMD, that values have to fetched and converted to a form the CPU has multipliers for. This means smaller numeric types penalised. For a 128-bit memory bus, an NPU can fetch 32 4-bit values per transfer; the best case for a CPU is 16 8-bit values.
Details are scant on Microsoft's NPU, but it probably has many parallel multipliers; either in the form of tensor cores or a systolic array. The effective number of matmul's per second (or per memory operation) is higher.
Yeah standalone GPUs do indeed have more bandwidth, but most of these Copilot PCs that have NPUs just have shared memory for everything I think.
fetching 16 8 bit values vs 32 4 bit values is the same, this is the form they are stored in memory. Doing some unpacking into more registers and back is more or less free anyway, if you are memory bandwidth bound. Largely on these lower end machines everything is memory bound not compute bound, although the CPUs cant often use the full memory bandwidth in some systems (eg the Macs) but the GPU can.
Yes, agree. Probably the main thing is the NPU is just a dedicated unit without the generality / complexity of a CPU and so able to crunch matmuls more efficiently.
Unlike smart watches where integrating the watch and phone probably has lots of opportunity for innovation left, I agree there isn't much value in proprietary printer inks and standardization probably has more consumer benefit.
Yeah I suspect they know they can never block everything but if they can block 98% of "casual users" they've probably reached their goal. They will just put out propaganda that the other 2% technically apt people who get around it are conspiracy nuts, western civ sympathizers, traitors to mother russia, etc.
Counterpoint: this YouTube rant by an animation person called Noodle is a pretty good overview of why frame interpolation sucks. https://www.youtube.com/watch?v=_KRb_qV9P4g
Basically, low FPS can be a stylistic choice, and making up new frames at playback time often completely butchers some of the nuances present in good animation.
If you are a good director you can make the most of that low budget. Look at the first episodes of Scum's wish (https://www.imdb.com/title/tt6197170/) if you want a good example.
Animation is the worst use case for motion interpolation because the frames are individually drawn and timed by the animators to achieve a particular look and feel.
reply