Hacker News new | past | comments | ask | show | jobs | submit login
Workers AI: Serverless GPU-powered inference (cloudflare.com)
261 points by jgrahamc on Sept 27, 2023 | hide | past | favorite | 114 comments



I tried serverless for whisper on an existing competitive service.

It had a cold boot and run on a 8 word STT time of 45 seconds and warm never got past 15 seconds.

This does not work for STT, where it has to be much faster turnaround.

Can anyone give feedback on if whisper and any of its model sizes can work well on serverless?

Do most AI serverless solutions suffer from significant cold boot delays?

The cheapest persistent GPU cloud instance I saw on G was ~$160 a month. Is that roughly the kind of money people need to be prepared to spend to have a model ready to go at all times as a service to another product?


Whisper large is only 1.5B params, why not run it client side with something like https://github.com/FL33TW00D/whisper-turbo

(Disclaimer: I am the author)


Seems like webgpu is not supported by mobile safari yet.

And possibly has coverage in only 65% of the desktop browser market. [1] Does that roughly conform to how you understand the current penetration of this browser api?

Presuming coverage for a given user, I don’t have a good answer for why to consider remote.

It seems like it would be worth testing compatibility for webgpu and attempting to run on the client if possible, but then have a remote instance available otherwise.

Does that make sense to you?

Can you tell me another reason why someone would want a remote instance of whisper given 20x realtime potential at client in your project?

[1] https://caniuse.com/webgpu


Indeed WebGPU is basically only supported on Chromium based browsers.

This means that the primary usecase for whisper-turbo and my upcoming libraries is Electron/Tauri apps. For users that don't have WebGPU supported for whatever reason, we will still hit OAI/other server deployment. In the ideal case there should be a 90% cost reduction and same or improved UX.

Someone will still want a remote instance today as there is still engineering to be done. I need more aggressive quantization, better developer experience and more features in order to get people off of the OAI API.


Got it.

I see WebGPU is available in safari tech preview 92, but still an "experimental" feature there.

It looks like Webgpu is kind of been around the block for a while now. I wonder what the hold up is at firefox and safari. It would be much preferable to run more ops on the client. (Complete speculation but I could see this, and the battery use implications possibly making Apple hesitant)

My goal is to provide a browser-based experience first. This is to get at the largest potential user base and there's no friction from install. So, at least for now a electron app is not in plan.


WebGPU will be multiple years out for Safari, they've had and removed implementations prior.

Firefox has a working implementation, but it lags behind the Chromium one.

Browser based still works great! Check out the whisper turbo demo: https://whisper-turbo.com/


Maybe the client is another backend service or serverless function, i.e. one where they'd need to pay for the GPU anyway.


Serverless only works if the cold boot is fast. For context, my company runs a serverless cloud GPU product called https://beam.cloud, which we've optimized for fast cold start. We see Whisper in production cold start in under 10s (across model sizes). A lot of our users are running semi-real time STT, and this seems to be working well for them.


>...this seems to be working well for them.

Is this because the users are streaming audio in a more conversational style?

For example, when you give siri a command, it is stated, and then you stop speaking.

For most of ChatGPT‘s life, in openAI’s iOS app, if you wanted to speak to input text, you would tap the record button, and then tap it off, either using the app’s own Speech to text capability or siri’s input field speech to text.

Conversational speech to text is more ongoing, though, which would make a 10 second cold start OK, because you don’t sense as much lag because you’re continuing to speak.

Or perhaps people in general record input longer than 10 seconds, And you are sending the first chunk as soon as possible to get whisper going.

Then follow up chunks are handled as warm boots? Then the text is reassembled? Is that roughly correct?

Anything you can provide on sort of the request and data flow that works with a longer cold boot time in the context of single recording versus streaming, and how audio is broken up would be helpful.


Cloudflare has almost no cold boot and I think their ml models are prefetched within the same DC, so loading the model the first time should have no noticable overhead either.

Correct me if I'm wrong, but that's how I interpreted it when they first started with ML models on their CPU's.


The models in our catalog are all pre-loaded before requests come in.


Makes sense. A bring-your-own-model feature would be awesome, but would make it impossible to do that for rarely-used models without using valuable GPU RAM.


(STT: Speech to Text)


which service did you look into ?


It is a startup and happened in the onboarding demo, I'll avoid naming them in case they sort it out or whatever. If you really want to know, please send me an email.


This is very cool. I'm still trying to understand the pricing. What is a "neuron" in this context? A token? A character?

"Neurons are a way to measure AI output that always scales down to zero (if you get no usage, you will be charged for 0 neurons). To give you a sense of what you can accomplish with a thousand neurons, you can: generate 130 LLM responses, 830 image classifications, or 1,250 embeddings."

130 LLM responses of that length? 1,250 embeddings what size of text?


It's effectively a unit of time benchmarked to what we can accomplish in that time as of Sept 27, 2023 (launch). The challenge here is that because we're abstracting away the underlying hardware it's not the same as renting a VM for a period of time. We also don't want to create perverse incentives that keep us from making the underlying system faster. It's similar to how AWS standardized EC2 to a standard compute unit. Over time, as we continue to add faster and faster hardware and better optimize models we expect the cost of a neuron will trend down but the amount of AI inference work that you can do with a neuron will remain relatively constant.


Then call it something like Neural Time Unit (NTU) or Computational Time Unit (CTU) because neurons make people think of neural networks. As in, you pay for the size of your model.


Could you give us an example of what that means in practical terms?

The post says that 1000 neurons will give you 130 LLM responses - but of what length?

(LLMs are generally priced by input and output tokens. The longer the tokens the longer the compute time. Without an idea of what you mean by a response it's hard to understand.)

Likewise: 1,250 embeddings – How big is the text size in the example?

I'm VERY excited to see you doing this and understand it's early stages, but I wan't wrap my head around the pricing without context.


Please rename it, or at least make sure it corresponds to actual neural operations. It's terribly confusing for practitioners.


Really amazing stuff to see this launch with Hugginface! Hope to see it expand beyond text too.

“neuron” is a cute name but there’s too much conceptual overlap with floating point ops, layers, model parameters etc which are time independent. Should just call them inference credits or something. When some large model runs on multiple GPUs it’s even more confusing what neurons / dollars per second might be.


Sounds like 1 Neuron ~= X FLOPS



Those don't explain the relation between neuron cost and length.


Those max tokens seem pretty low


Interesting how the same exact blog url was used previously in April 2021.

https://news.ycombinator.com/item?id=26795517


That was our early alpha cooperation with NVIDIA. We've learned a lot since then. Not to mention, the AI ecosystem has grown up a bunch. But you are correct: this is something we've been planning for for a loooooooong time.


I'd love to see a blog post about the knowledge delta and growth progress. That's more interesting to me than the actual announcements.


I tried to spin up a free plan and run the whisper demo on a new worker and it immediately just gives me an:

    Error 1102
    Worker exceeded resource limits
Did I mess up the config or is it just not intended to be tried out without being on a paid plan already?


It's 100% designed to let you try it out for free, so something else must be going on. Feel free to message me at pwittig at cloudflare dot com, and I'm happy to help debug.

Also, we're still figuring some things out, but current limits are here: https://developers.cloudflare.com/workers-ai/platform/limits...


I want to love workers but I've never had great luck with anymore more than a basic crud app. Even getting an external db to connect proved to be more work than it should have, and their docs are all outdated and all over the place, often contradicting themselves.


The docs are soooooo bad. This is the same story with any exciting / new product.

Honestly, I'm beginning to think that we need some kind of documentation-first style development. Like TDD, but DDD....

I have integrated Facebook APIs, Instagram (well, same thing), Google APIs, Stripe APIs, Mailchimp APIs, etc. etc.

And the only thing common among all of them? The documentation is _terrible_..., like, _terrible_.

I also run a few products online and I spend so, so much time trying to get the documentation right. It's incredibly boring and tedious, but I really feel that if you want to set yourself apart from the big players, make good documentation. It can't be that hard.


good documentation is the key to php's success along with a bunch of other things that get dismissed as "inferior technology".

The worst documentation are the jargon filled abstract vibes-based ones where the authors basically typed it with one hand on the keyboard. It's like "ok, you're amazing. Now how do I resolve this error and what are your command line flags?"


What do you think is missing in the docs?


I am having quite a few issues getting the reference API code at https://developers.cloudflare.com/workers-ai/models/llm/ to work?

{'errors': [{'code': 'invalid_union', 'unionErrors': [{'issues': [{'code': 'invalid_type', 'expected': 'object', 'received': 'string', 'path': ['body'], 'message': 'Expected object, received string'}], 'name': 'ZodError'}, {'issues': [{'code': 'invalid_type', 'expected': 'object', 'received': 'string', 'path': ['body'], 'message': 'Expected object, received string'}], 'name': 'ZodError'}], 'path': ['body'], 'message': 'Invalid input'}], 'success': False, 'result': {}}


Sorry for the trouble. The docs has since been updated - you should try again.

Also happy to help if you run into any other issues - pwittig at cloudflare dot com.


TL;DR: GPUs all over the Cloudflare global network; working closely with Microsoft, Meta, Hugging Face, Databricks, NVIDIA; new Cloudflare-native vector database; inference embedded in Cloudflare Workers; native support for WebGPU. Live demo: https://ai.cloudflare.com/


Do you actually run the inference in the worker? Or is it like what Fermyon does where they basically host the models for you and you get a SDK that is automatically connected to the function?


Unlike the first version of Constellation, Workers AI runs inference directly on GPUs that we are (quickly) installing in our global network.


But the code isn't running on the worker? It runs somewhere else on a GPU cluster?


It's a little like how Cloudflare Workers runs. You don't know which CPU it runs on, all you know is it's a CPU close to your end user. Same goes for this. We are rolling out GPUs everywhere across the globe and so Workers AI will just use a nearby GPU. Probably in the same machine as your workers, or maybe the same data center, or whatever other smart routing decision we make. What we are not doing is running a massive GPU cluster somewhere. This is all distributed and that's the power of owning your own network.


Since they don’t seem to be able to give a simple answer: the inference does not run in the worker. It connects to external GPUs.


I think the confusion is what is meant by "in the Worker." From a hardware perspective, the GPU may be in the same machine as the CPU that's powering the Worker. Or they may be across different machines in our network. We are not routing requests to some third party. And we will try to run the inference task as close as possible to who/whatever requested it. The whole idea of "serverless" is you shouldn't have to worry about what machine where runs whatever unless you're on the team building the scheduling and routing logic at Cloudflare.


I think his question is more about does the worker directly access the GPU and thus require js tooling to handle the GPU somehow (no), or does it make subrequests to a separate GPU service not running the worker runtime (yes).


Hey John, great work on this! Just a headsup, small typo on that page under R2: "Build mutli-cloud training architectures with free egress."


Thanks. Getting it fixed.


Any chance you're looking for technical product folks to work on this? I actually worked on a very similar deployment internally at Livepeer (focus was on live video enhancements but also generalized edge compute)!


we always are! email is rita at cloudflare dot com :)


thanks!


I see plans for more models via HF partnership, but can I or will I be able to run a custom fine-tuned version of a supported model?


On top of our hosted and supported catalog of models, and the deploy to CF partnerships like the HF one, you will also be able to bring your own custom model at some point in time.


Awesome. What about compiled model support? Running most of the listed models without compilation only makes sense for hobby projects.


Is CodeLlama somewhere on the roadmap?


Very cool and also very simple as I’d expect from Cloudflare.

But I have a question - why not make inference as easy as the translation? Why do I have to run that in a worker rather than just as a simple API call? That would be much simpler.

Is there a technical reason or is it that people would want to have logic before making the call to llama?


you can do both! all of our models are supported either via workers / pages binding (which makes it really easy to host the rest of the logic), and via REST API

docs: https://developers.cloudflare.com/workers-ai/get-started/res...

(llama specific example here too under curl: https://developers.cloudflare.com/workers-ai/models/llm/ )


Awesome! Thanks for that. Cloudflare smashing it on simplicity as per usual.


@jgrahamc

Very curious if you can elaborate on Vectorize. More than edge GPU's, entering the Vector DB marketplace and a CF proprietary integration is interesting (and a bit scary) to build on.

- Will Vectorize ever get OSS'd?

- If you want to migrate either direction from some other Vector DB(Milvus, Weaviate, Qdrant, Pinecone, etc), what should you expect in terms of level of effort and features?

- What inherent advantages(latency? features?) would you get exclusively from Vectorize?


This sounds really cool. I already love cloudflare because of how easy they make it to compete with bigger companies for indie devs like myself.

Their pricing for their products always seems so much more affordable than something like AWS or GCP. I'm using R2 for storage for a client project, and from my calculations we would have to pay almost 3 times more if we hosted the files with AWS on S3.

I really hope they keep adding all the latest open-source AI models to their platform. If their pricing will be as cheap as they state in this blog post, then I would rather use this service than install models on my computer. To get a good inference speed for open source models on my PC right now, I have to let usage spike to up to 100%...


This could become something useful in the future, but right now it appears to be toy models only. I'm assuming Cloudflare will add useful models eventually, but the cold start times are going to be horrible on those. I'm struggling to think of useful applications for this. Maybe one day.


Which models would you find useful?


Depends on the task, but generally, LLaMA 2 finetunes of at least 13B params, 4-bit quantized


did you see we have lamma2-7b int8?


Can you provide an example of a task where the 7b model is useful?


It will be interesting to see if they can undercut OpenAI themselves on the cost for running Whisper in the cloud.


Are OpenAI still limiting Whisper to 50 requests per minute (and gpt-3.5-turbo at 3/min)? If so then they don't need to undercut OpenAI but just provide unlimited requests.. it's near impossible to provide user-facing AI solutions to customers due to these limits, and it's a very bad user experience (and security risk) to force users to provide their own OpenAI API keys.

https://replicate.com/ is much better with an average 10 requests per second but would still very much prefer unlimited (where they can monitor high volume users to see if it's legitimate), or add new pricing model where e.g. 0-10k/rpm is at $0.001/sec, 10-100k/rpm is at $0.010/sec, 100k+/rpm is at $0.100/sec (pricing would of course need to be fine-tuned, just a quick example).



AWS too -> https://aws.amazon.com/blogs/aws/amazon-bedrock-is-now-gener...

> (Coming Soon) The Llama 2 13B and 70B parameter models by Meta will soon be available via Amazon Bedrock’s fully managed API for inference and fine-tuning.


It would be super cool to see SD running on this! hyped to play around with llama since I don't have access to a good GPU.


The biggest one missing is stable diffusion.


stay tuned!


Any chance you're looking for technical product folks to work on this? I actually lead a very similar deployment internally at Livepeer (focus was on live video enhancements but also generalized edge compute)!


You can always email jgc@cloudflare.com and I'll route the resume to the right people.


thanks!


At this cost, no one will be running embeddings elsewhere :o

https://twitter.com/eastdakota/status/1707056412575023352?t=...


I've never played with Cloudflare workers, but I thought they were implemented as JavaScript runtimes that form an edge computing network

Are the models run in JavaScript/WebAssembly behind the scenes?


You can use Javascript or Wasm to interface with the AI binding, think of it of as the SDK, but the inference task itself runs natively on top of a ML runtime and the models are loaded into GPUs.


Given this is a hosted API rather than arbitrary hosting, why choose the word "serverless"? Do you plan to offer arbitrary hosting in the future?

(bias: am Banana CEO)


Embedding cost and model choice makes this a very compelling choice. I'm working on leveraging embeddings in https://github.com/discourse/discourse-ai where it powers offering related topics, semantic search, tag and category recommendations among other things.

A cheap offering like this can make it a lot more reasonable for self-hosters.


This would be a game changer if they had something for image generation as well. Oh well, maybe it's coming soon as the page says it's just a small preview. The best part for me personally is it's available on all plans and pricing looks good too.

OT but if cloudflare fixes their false positives on showing random captchas when I'm trying to browse the net on VPN they will surely be one of my favorite companies.


Image generation is coming.


kinda cool, but rather limited if you can't use custom models.


You will be able to.


looking forward too it!


What’s the cold boot time? Isn’t that the most important part?


How does Workers AI compare to Replicate.com?


There are a lot of similarities, but here are a few differences:

It's region-less, and runs your inference task on the Cloudflare network, near your end users. Though that's not entirely true yet - we'll be in 100 sites by EOY '23, and nearly everywhere by EOY '24.

It was built to work alongside our new vector database, Vectorize, out of the box.

It's accessible to all developers, regardless of where you deploy (via API), but we wanted to offer a seamless option for developers already building on Cloudflare - Workers, Pages, etc.


Thought they were gonna go CPU based inference with llama.cpp/ggml, this project should definitely be made into a programming book


I think they quickly reached the limits of CPU inference and that's why they decided to scale up with GPU's.

Because of how they operate, they can't just release in a single DC like other providers. It's a partial "all or nothing" scenario.


Unless I missed something, it seems they didn't share which models by parameter count they're gonna be offering, nor anything at all about cost.

Also, this strikes me as really weird phrasing:

> Llama 2 now available for global usage on Cloudflare’s serverless platform, providing privacy-first, local inference to all

"privacy-first & local inference" would be to run it locally, on your own hardware, isn't that exactly what should be referred by when using "local"? Or has the definitions of words completely gone out the window as of late?


This is just a press release, and not a great source of in-depth info. This blog post from yesterday goes into it: https://blog.cloudflare.com/workers-ai/

>Models you know and love

>We’re launching with a curated set of popular, open source models, that cover a wide range of inference tasks:

>Text generation (large language model): meta/llama-2-7b-chat-int8

>Automatic speech recognition (ASR): openai/whisper

>Translation: meta/m2m100-1.2

>Text classification: huggingface/distilbert-sst-2-int8

>Image classification: microsoft/resnet-50

>Embeddings: baai/bge-base-en-v1.5


I missed that blog post somehow, thanks for sharing that. Bit disappointing that it's just the 7B model, it's a good starting point for fine tuning another small model, but it really isn't useful on its own.

> This is just a press release, and not a great source of in-depth info

Not sure I'd call outright lying/getting the most basic points wrong "not a great source of in-depth info", like the "local-first" part.


> Or has the definitions of words completely gone out the window as of late?

I think it means local as in "nearby" - so if you're in the UK it isn't processed in us-east-1, for example. You can pick one local to you.


Yeah, I think so too, but I guess I'm more upset that they use "local" in a "remote but physically nearby" fashion when it usually refers to "this computer/device".


Yeah - I think it's "local" as in "localisation". It is a little overloaded.


At least you get privacy from Meta.


Considering this is a collaboration, don't you think the agreement between them has something about feeding data back to Meta?


I just think it's loading the models and revenue sharing.

It would be similar as their partnership with Huggin Face, I suppose.

What data would be useful to capture? Outside of a human feedback loop, I don't think there is an actual use-case for it.

Note: not certain


Without anyone of us actually knowing the details of the collaboration, all we can do is guess, I guess.

> What data would be useful to capture?

Pipe the prompts straight to Meta and I'm sure they'll be able to extract a ton of useful data.


What's the use-case without any human feedback, which I already mentioned?


I don't work at Facebook, so I'm not gonna spend my time doing their work for them or you.

But off the top of my head, classifying the data into various categories and mentions of Facebook/Meta could allow them to derive sentiment about the company based on geographical location, and know where to invest more in changing the sentiment.


Even if this would be a valid suggestion ( I don't think so).

This would be incredible hard to do without IP.

Cloudflare won't forward it and it can be setup as an API. Additionally, IP's on mobile are re-used across users.

That doesn't even sound remotely to a valid indicator for "investments". Definitely not considering the costs to create and update such a model.


100% marketing speak.


Can you use this for _anything_ you like? Business uses es etc? Not sure what the terms are for Llama. I thought they were restricted in some way?


LLaMa 2 is sadly not open source and not open science, despite what Facebook keeps on claiming:

https://blog.opensource.org/metas-llama-2-license-is-not-ope...

Thus you will have to read the license and judge whether it is compatible with your business or use cases.

One should give them credit for making it available, which is a lot better than plenty of others. However, actually open models are starting to appear, so perhaps we will soon see Facebook and the likes making theirs open as well? Who knows.


How well-proteced is the "edge" in edge computing? I can see Cloudflare has edge locations in many countries. Can entities with physical access to Cloudflare's edge machines get access to sensitive user data?


With Cloudflare's default settings, a malicious entity can intercept any Cloudflare <-> Backend connections invisibly to the end user since the SSL certificates aren't validated. The end user also can be victim to plain old HTTP MITM on Cloudflare's upstream networks, as happened in 2016: https://news.ycombinator.com/item?id=12091900

It's hard to take Cloudflare's commitment to security seriously when they still ship such terrible default settings.


What do you mean?

You can install certificates by cloudflare and then the only one that can connect to your server is from cloudflare.

No one can intercept it then.

If you're talking about flexible SSL. Sure, you can use it purely as a https proxy for your SEO score of your blog. But securing it is not much effort.

If it's just for a static blog, I'm not sure what you would though.


If you have a valid HTTPS certificate for example.com and then add example.com to Cloudflare, your overall security decreases because the path from the CF datacenter to your origin is now vulnerable to MITM - the default SSL setting is "Full" which doesn't check certificate validity.

To the less experienced sysadmin everything looks like it's working fine and users also don't notice any difference, which is why it's a terrible default.

Sure you _can_ configure Cloudflare securely, but it should be secure out of the box. But that adds friction when the origin doesn't have a valid SSL certificate which probably hurts someone's KPIs.



We were building AI https://efn.kr/#ai into https://RTCode.io and ...

Cloudflare drops this! Sweet! Now, we have BYOAI.

Our whole offering runs on their network https://RTEdge.net

Our playground lets one live-code user workers that deploy to Cloudflare Worker for Platforms!

- https://sw.rt.ht/?io (in-browser)

- https://sw.rt.ht/ (region-Earth)


All well and good, but why is the webpage loading this obfuscated javascript file? https://archive.is/htQgN


CEO here, simply because we want to protect the core components that differentiate our services. Similar to, say https://www.photopea.com/ and many others we can find if we look behind the scenes.

Once we raise funding, and establish strong market presence, we will revisit this decision, and dedicate developer resources to sharing our in-house tech with the world more openly. This will take a full position. If you liked what you see, and want to work with us, send us an email at work@elefunc.com and we will get in touch once we have open positions!


Looks interesting, but as feedback, the above-the-fold message on your homepage is buzzword salad, not really clear what you're offering a potential customer here. "Elefunc is building the real-time web, to empower thought-speed creativity."

By way of comparison, I know exactly what I'm getting as a developer on Replit's homepage: "Make something great. Build software collaboratively with the power of AI, on any device, without spending a second on setup"


Thank you! We are just getting started and you offer our first public valuable feedback on the company landing page. I will make sure our above-the-fold becomes just as clear! If you have further suggestions, please send them to support@elefunc.com




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: