Do you actually run the inference in the worker? Or is it like what Fermyon does...

celso · on Sept 27, 2023

Unlike the first version of Constellation, Workers AI runs inference directly on GPUs that we are (quickly) installing in our global network.

ushakov · on Sept 27, 2023

But the code isn't running on the worker? It runs somewhere else on a GPU cluster?

jgrahamc · on Sept 27, 2023

It's a little like how Cloudflare Workers runs. You don't know which CPU it runs on, all you know is it's a CPU close to your end user. Same goes for this. We are rolling out GPUs everywhere across the globe and so Workers AI will just use a nearby GPU. Probably in the same machine as your workers, or maybe the same data center, or whatever other smart routing decision we make. What we are not doing is running a massive GPU cluster somewhere. This is all distributed and that's the power of owning your own network.

pseg134 · on Sept 27, 2023

Since they don’t seem to be able to give a simple answer: the inference does not run in the worker. It connects to external GPUs.

eastdakota · on Sept 27, 2023

I think the confusion is what is meant by "in the Worker." From a hardware perspective, the GPU may be in the same machine as the CPU that's powering the Worker. Or they may be across different machines in our network. We are not routing requests to some third party. And we will try to run the inference task as close as possible to who/whatever requested it. The whole idea of "serverless" is you shouldn't have to worry about what machine where runs whatever unless you're on the team building the scheduling and routing logic at Cloudflare.

thegagne · on Sept 27, 2023

I think his question is more about does the worker directly access the GPU and thus require js tooling to handle the GPU somehow (no), or does it make subrequests to a separate GPU service not running the worker runtime (yes).