Agreed. This isn’t actually that useful of a guide in the first place.
Tbh the most basic question is: “are you innovating inside the AI box or outside the AI box?”
If inside - this guide doesn’t really share anything practical. Like if you’re going to be tinkering with a core algorithm and trying to optimize it, understanding BLAS and cuBLAS or whatever AMD / Apple / Google equivalent, then understanding what pandas, torch, numpy and a variety of other tools are doing for you, then being able to wield these effectively makes more sense.
If outside the box - understanding how to spot the signs of inefficient use of resource - whether that’s network, storage, accelerator, cpu, or memory, and then reasoning through how to reduce that bottleneck.
Like - I’m certain we will see this in the near future, but off the top of my head the innocent but incorrect things people do:
1. Sending single requests, instead of batching
2. Using a synchronous programming model when asynchronous is probably better
3. Sending data across a compute boundary unnecessarily
4. Sending too much data
5. Assuming all accelerators are the same. That T4 gpu is cheaper than an H100 for a reason.
6. Ignoring bandwidth limitations
7. Ignoring access patterns
Tbh the most basic question is: “are you innovating inside the AI box or outside the AI box?”
If inside - this guide doesn’t really share anything practical. Like if you’re going to be tinkering with a core algorithm and trying to optimize it, understanding BLAS and cuBLAS or whatever AMD / Apple / Google equivalent, then understanding what pandas, torch, numpy and a variety of other tools are doing for you, then being able to wield these effectively makes more sense.
If outside the box - understanding how to spot the signs of inefficient use of resource - whether that’s network, storage, accelerator, cpu, or memory, and then reasoning through how to reduce that bottleneck.
Like - I’m certain we will see this in the near future, but off the top of my head the innocent but incorrect things people do: 1. Sending single requests, instead of batching 2. Using a synchronous programming model when asynchronous is probably better 3. Sending data across a compute boundary unnecessarily 4. Sending too much data 5. Assuming all accelerators are the same. That T4 gpu is cheaper than an H100 for a reason. 6. Ignoring bandwidth limitations 7. Ignoring access patterns