TL;DR: GPUs all over the Cloudflare global network; working closely with Microso...

ushakov · on Sept 27, 2023

Do you actually run the inference in the worker? Or is it like what Fermyon does where they basically host the models for you and you get a SDK that is automatically connected to the function?

celso · on Sept 27, 2023

Unlike the first version of Constellation, Workers AI runs inference directly on GPUs that we are (quickly) installing in our global network.

ushakov · on Sept 27, 2023

But the code isn't running on the worker? It runs somewhere else on a GPU cluster?

jgrahamc · on Sept 27, 2023

It's a little like how Cloudflare Workers runs. You don't know which CPU it runs on, all you know is it's a CPU close to your end user. Same goes for this. We are rolling out GPUs everywhere across the globe and so Workers AI will just use a nearby GPU. Probably in the same machine as your workers, or maybe the same data center, or whatever other smart routing decision we make. What we are not doing is running a massive GPU cluster somewhere. This is all distributed and that's the power of owning your own network.

pseg134 · on Sept 27, 2023

Since they don’t seem to be able to give a simple answer: the inference does not run in the worker. It connects to external GPUs.

eastdakota · on Sept 27, 2023

I think the confusion is what is meant by "in the Worker." From a hardware perspective, the GPU may be in the same machine as the CPU that's powering the Worker. Or they may be across different machines in our network. We are not routing requests to some third party. And we will try to run the inference task as close as possible to who/whatever requested it. The whole idea of "serverless" is you shouldn't have to worry about what machine where runs whatever unless you're on the team building the scheduling and routing logic at Cloudflare.

thegagne · on Sept 27, 2023

I think his question is more about does the worker directly access the GPU and thus require js tooling to handle the GPU somehow (no), or does it make subrequests to a separate GPU service not running the worker runtime (yes).

tebbers · on Sept 27, 2023

Hey John, great work on this! Just a headsup, small typo on that page under R2: "Build mutli-cloud training architectures with free egress."

jgrahamc · on Sept 28, 2023

Thanks. Getting it fixed.

foggedb0nk · on Sept 27, 2023

Any chance you're looking for technical product folks to work on this? I actually worked on a very similar deployment internally at Livepeer (focus was on live video enhancements but also generalized edge compute)!

rita3ko · on Sept 27, 2023

we always are! email is rita at cloudflare dot com :)

foggedb0nk · on Sept 28, 2023

thanks!

claytonjy · on Sept 27, 2023

I see plans for more models via HF partnership, but can I or will I be able to run a custom fine-tuned version of a supported model?

celso · on Sept 27, 2023

On top of our hosted and supported catalog of models, and the deploy to CF partnerships like the HF one, you will also be able to bring your own custom model at some point in time.

claytonjy · on Sept 27, 2023

Awesome. What about compiled model support? Running most of the listed models without compilation only makes sense for hobby projects.

aprxi · on Sept 27, 2023

Is CodeLlama somewhere on the roadmap?