I tried serverless for whisper on an existing competitive service.
It had a cold boot and run on a 8 word STT time of 45 seconds and warm never got past 15 seconds.
This does not work for STT, where it has to be much faster turnaround.
Can anyone give feedback on if whisper and any of its model sizes can work well on serverless?
Do most AI serverless solutions suffer from significant cold boot delays?
The cheapest persistent GPU cloud instance I saw on G was ~$160 a month. Is that roughly the kind of money people need to be prepared to spend to have a model ready to go at all times as a service to another product?
Seems like webgpu is not supported by mobile safari yet.
And possibly has coverage in only 65% of the desktop browser market. [1] Does that roughly conform to how you understand the current penetration of this browser api?
Presuming coverage for a given user, I don’t have a good answer for why to consider remote.
It seems like it would be worth testing compatibility for webgpu and attempting to run on the client if possible, but then have a remote instance available otherwise.
Does that make sense to you?
Can you tell me another reason why someone would want a remote instance of whisper given 20x realtime potential at client in your project?
Indeed WebGPU is basically only supported on Chromium based browsers.
This means that the primary usecase for whisper-turbo and my upcoming libraries is Electron/Tauri apps. For users that don't have WebGPU supported for whatever reason, we will still hit OAI/other server deployment. In the ideal case there should be a 90% cost reduction and same or improved UX.
Someone will still want a remote instance today as there is still engineering to be done. I need more aggressive quantization, better developer experience and more features in order to get people off of the OAI API.
I see WebGPU is available in safari tech preview 92, but still an "experimental" feature there.
It looks like Webgpu is kind of been around the block for a while now. I wonder what the hold up is at firefox and safari. It would be much preferable to run more ops on the client. (Complete speculation but I could see this, and the battery use implications possibly making Apple hesitant)
My goal is to provide a browser-based experience first. This is to get at the largest potential user base and there's no friction from install. So, at least for now a electron app is not in plan.
Serverless only works if the cold boot is fast. For context, my company runs a serverless cloud GPU product called https://beam.cloud, which we've optimized for fast cold start. We see Whisper in production cold start in under 10s (across model sizes). A lot of our users are running semi-real time STT, and this seems to be working well for them.
Is this because the users are streaming audio in a more conversational style?
For example, when you give siri a command, it is stated, and then you stop speaking.
For most of ChatGPT‘s life, in openAI’s iOS app, if you wanted to speak to input text, you would tap the record button, and then tap it off, either using the app’s own Speech to text capability or siri’s input field speech to text.
Conversational speech to text is more ongoing, though, which would make a 10 second cold start OK, because you don’t sense as much lag because you’re continuing to speak.
Or perhaps people in general record input longer than 10 seconds, And you are sending the first chunk as soon as possible to get whisper going.
Then follow up chunks are handled as warm boots? Then the text is reassembled? Is that roughly correct?
Anything you can provide on sort of the request and data flow that works with a longer cold boot time in the context of single recording versus streaming, and how audio is broken up would be helpful.
Cloudflare has almost no cold boot and I think their ml models are prefetched within the same DC, so loading the model the first time should have no noticable overhead either.
Correct me if I'm wrong, but that's how I interpreted it when they first started with ML models on their CPU's.
Makes sense. A bring-your-own-model feature would be awesome, but would make it impossible to do that for rarely-used models without using valuable GPU RAM.
It is a startup and happened in the onboarding demo, I'll avoid naming them in case they sort it out or whatever. If you really want to know, please send me an email.
This is very cool. I'm still trying to understand the pricing. What is a "neuron" in this context? A token? A character?
"Neurons are a way to measure AI output that always scales down to zero (if you get no usage, you will be charged for 0 neurons). To give you a sense of what you can accomplish with a thousand neurons, you can: generate 130 LLM responses, 830 image classifications, or 1,250 embeddings."
130 LLM responses of that length?
1,250 embeddings what size of text?
It's effectively a unit of time benchmarked to what we can accomplish in that time as of Sept 27, 2023 (launch). The challenge here is that because we're abstracting away the underlying hardware it's not the same as renting a VM for a period of time. We also don't want to create perverse incentives that keep us from making the underlying system faster. It's similar to how AWS standardized EC2 to a standard compute unit. Over time, as we continue to add faster and faster hardware and better optimize models we expect the cost of a neuron will trend down but the amount of AI inference work that you can do with a neuron will remain relatively constant.
Then call it something like Neural Time Unit (NTU) or Computational Time Unit (CTU) because neurons make people think of neural networks. As in, you pay for the size of your model.
Could you give us an example of what that means in practical terms?
The post says that 1000 neurons will give you 130 LLM responses - but of what length?
(LLMs are generally priced by input and output tokens. The longer the tokens the longer the compute time. Without an idea of what you mean by a response it's hard to understand.)
Likewise: 1,250 embeddings – How big is the text size in the example?
I'm VERY excited to see you doing this and understand it's early stages, but I wan't wrap my head around the pricing without context.
Really amazing stuff to see this launch with Hugginface! Hope to see it expand beyond text too.
“neuron” is a cute name but there’s too much conceptual overlap with floating point ops, layers, model parameters etc which are time independent. Should just call them inference credits or something. When some large model runs on multiple GPUs it’s even more confusing what neurons / dollars per second might be.
That was our early alpha cooperation with NVIDIA. We've learned a lot since then. Not to mention, the AI ecosystem has grown up a bunch. But you are correct: this is something we've been planning for for a loooooooong time.
It's 100% designed to let you try it out for free, so something else must be going on. Feel free to message me at pwittig at cloudflare dot com, and I'm happy to help debug.
I want to love workers but I've never had great luck with anymore more than a basic crud app. Even getting an external db to connect proved to be more work than it should have, and their docs are all outdated and all over the place, often contradicting themselves.
The docs are soooooo bad. This is the same story with any exciting / new product.
Honestly, I'm beginning to think that we need some kind of documentation-first style development. Like TDD, but DDD....
I have integrated Facebook APIs, Instagram (well, same thing), Google APIs, Stripe APIs, Mailchimp APIs, etc. etc.
And the only thing common among all of them? The documentation is _terrible_..., like, _terrible_.
I also run a few products online and I spend so, so much time trying to get the documentation right. It's incredibly boring and tedious, but I really feel that if you want to set yourself apart from the big players, make good documentation. It can't be that hard.
good documentation is the key to php's success along with a bunch of other things that get dismissed as "inferior technology".
The worst documentation are the jargon filled abstract vibes-based ones where the authors basically typed it with one hand on the keyboard. It's like "ok, you're amazing. Now how do I resolve this error and what are your command line flags?"
TL;DR: GPUs all over the Cloudflare global network; working closely with Microsoft, Meta, Hugging Face, Databricks, NVIDIA; new Cloudflare-native vector database; inference embedded in Cloudflare Workers; native support for WebGPU. Live demo: https://ai.cloudflare.com/
Do you actually run the inference in the worker? Or is it like what Fermyon does where they basically host the models for you and you get a SDK that is automatically connected to the function?
It's a little like how Cloudflare Workers runs. You don't know which CPU it runs on, all you know is it's a CPU close to your end user. Same goes for this. We are rolling out GPUs everywhere across the globe and so Workers AI will just use a nearby GPU. Probably in the same machine as your workers, or maybe the same data center, or whatever other smart routing decision we make. What we are not doing is running a massive GPU cluster somewhere. This is all distributed and that's the power of owning your own network.
I think the confusion is what is meant by "in the Worker." From a hardware perspective, the GPU may be in the same machine as the CPU that's powering the Worker. Or they may be across different machines in our network. We are not routing requests to some third party. And we will try to run the inference task as close as possible to who/whatever requested it. The whole idea of "serverless" is you shouldn't have to worry about what machine where runs whatever unless you're on the team building the scheduling and routing logic at Cloudflare.
I think his question is more about does the worker directly access the GPU and thus require js tooling to handle the GPU somehow (no), or does it make subrequests to a separate GPU service not running the worker runtime (yes).
Any chance you're looking for technical product folks to work on this? I actually worked on a very similar deployment internally at Livepeer (focus was on live video enhancements but also generalized edge compute)!
On top of our hosted and supported catalog of models, and the deploy to CF partnerships like the HF one, you will also be able to bring your own custom model at some point in time.
Very cool and also very simple as I’d expect from Cloudflare.
But I have a question - why not make inference as easy as the translation? Why do I have to run that in a worker rather than just as a simple API call? That would be much simpler.
Is there a technical reason or is it that people would want to have logic before making the call to llama?
you can do both! all of our models are supported either via workers / pages binding (which makes it really easy to host the rest of the logic), and via REST API
Very curious if you can elaborate on Vectorize. More than edge GPU's, entering the Vector DB marketplace and a CF proprietary integration is interesting (and a bit scary) to build on.
- Will Vectorize ever get OSS'd?
- If you want to migrate either direction from some other Vector DB(Milvus, Weaviate, Qdrant, Pinecone, etc), what should you expect in terms of level of effort and features?
- What inherent advantages(latency? features?) would you get exclusively from Vectorize?
This sounds really cool. I already love cloudflare because of how easy they make it to compete with bigger companies for indie devs like myself.
Their pricing for their products always seems so much more affordable than something like AWS or GCP. I'm using R2 for storage for a client project, and from my calculations we would have to pay almost 3 times more if we hosted the files with AWS on S3.
I really hope they keep adding all the latest open-source AI models to their platform. If their pricing will be as cheap as they state in this blog post, then I would rather use this service than install models on my computer. To get a good inference speed for open source models on my PC right now, I have to let usage spike to up to 100%...
This could become something useful in the future, but right now it appears to be toy models only. I'm assuming Cloudflare will add useful models eventually, but the cold start times are going to be horrible on those. I'm struggling to think of useful applications for this. Maybe one day.
Are OpenAI still limiting Whisper to 50 requests per minute (and gpt-3.5-turbo at 3/min)? If so then they don't need to undercut OpenAI but just provide unlimited requests.. it's near impossible to provide user-facing AI solutions to customers due to these limits, and it's a very bad user experience (and security risk) to force users to provide their own OpenAI API keys.
https://replicate.com/ is much better with an average 10 requests per second but would still very much prefer unlimited (where they can monitor high volume users to see if it's legitimate), or add new pricing model where e.g. 0-10k/rpm is at $0.001/sec, 10-100k/rpm is at $0.010/sec, 100k+/rpm is at $0.100/sec (pricing would of course need to be fine-tuned, just a quick example).
> (Coming Soon) The Llama 2 13B and 70B parameter models by Meta will soon be available via Amazon Bedrock’s fully managed API for inference and fine-tuning.
Any chance you're looking for technical product folks to work on this? I actually lead a very similar deployment internally at Livepeer (focus was on live video enhancements but also generalized edge compute)!
You can use Javascript or Wasm to interface with the AI binding, think of it of as the SDK, but the inference task itself runs natively on top of a ML runtime and the models are loaded into GPUs.
Embedding cost and model choice makes this a very compelling choice. I'm working on leveraging embeddings in https://github.com/discourse/discourse-ai where it powers offering related topics, semantic search, tag and category recommendations among other things.
A cheap offering like this can make it a lot more reasonable for self-hosters.
This would be a game changer if they had something for image generation as well. Oh well, maybe it's coming soon as the page says it's just a small preview. The best part for me personally is it's available on all plans and pricing looks good too.
OT but if cloudflare fixes their false positives on showing random captchas when I'm trying to browse the net on VPN they will surely be one of my favorite companies.
There are a lot of similarities, but here are a few differences:
It's region-less, and runs your inference task on the Cloudflare network, near your end users. Though that's not entirely true yet - we'll be in 100 sites by EOY '23, and nearly everywhere by EOY '24.
It was built to work alongside our new vector database, Vectorize, out of the box.
It's accessible to all developers, regardless of where you deploy (via API), but we wanted to offer a seamless option for developers already building on Cloudflare - Workers, Pages, etc.
Unless I missed something, it seems they didn't share which models by parameter count they're gonna be offering, nor anything at all about cost.
Also, this strikes me as really weird phrasing:
> Llama 2 now available for global usage on Cloudflare’s serverless platform, providing privacy-first, local inference to all
"privacy-first & local inference" would be to run it locally, on your own hardware, isn't that exactly what should be referred by when using "local"? Or has the definitions of words completely gone out the window as of late?
I missed that blog post somehow, thanks for sharing that. Bit disappointing that it's just the 7B model, it's a good starting point for fine tuning another small model, but it really isn't useful on its own.
> This is just a press release, and not a great source of in-depth info
Not sure I'd call outright lying/getting the most basic points wrong "not a great source of in-depth info", like the "local-first" part.
Yeah, I think so too, but I guess I'm more upset that they use "local" in a "remote but physically nearby" fashion when it usually refers to "this computer/device".
I don't work at Facebook, so I'm not gonna spend my time doing their work for them or you.
But off the top of my head, classifying the data into various categories and mentions of Facebook/Meta could allow them to derive sentiment about the company based on geographical location, and know where to invest more in changing the sentiment.
Thus you will have to read the license and judge whether it is compatible with your business or use cases.
One should give them credit for making it available, which is a lot better than plenty of others. However, actually open models are starting to appear, so perhaps we will soon see Facebook and the likes making theirs open as well? Who knows.
How well-proteced is the "edge" in edge computing? I can see Cloudflare has edge locations in many countries. Can entities with physical access to Cloudflare's edge machines get access to sensitive user data?
With Cloudflare's default settings, a malicious entity can intercept any Cloudflare <-> Backend connections invisibly to the end user since the SSL certificates aren't validated. The end user also can be victim to plain old HTTP MITM on Cloudflare's upstream networks, as happened in 2016: https://news.ycombinator.com/item?id=12091900
It's hard to take Cloudflare's commitment to security seriously when they still ship such terrible default settings.
You can install certificates by cloudflare and then the only one that can connect to your server is from cloudflare.
No one can intercept it then.
If you're talking about flexible SSL. Sure, you can use it purely as a https proxy for your SEO score of your blog. But securing it is not much effort.
If it's just for a static blog, I'm not sure what you would though.
If you have a valid HTTPS certificate for example.com and then add example.com to Cloudflare, your overall security decreases because the path from the CF datacenter to your origin is now vulnerable to MITM - the default SSL setting is "Full" which doesn't check certificate validity.
To the less experienced sysadmin everything looks like it's working fine and users also don't notice any difference, which is why it's a terrible default.
Sure you _can_ configure Cloudflare securely, but it should be secure out of the box. But that adds friction when the origin doesn't have a valid SSL certificate which probably hurts someone's KPIs.
CEO here, simply because we want to protect the core components that differentiate our services. Similar to, say https://www.photopea.com/ and many others we can find if we look behind the scenes.
Once we raise funding, and establish strong market presence, we will revisit this decision, and dedicate developer resources to sharing our in-house tech with the world more openly. This will take a full position. If you liked what you see, and want to work with us, send us an email at work@elefunc.com and we will get in touch once we have open positions!
Looks interesting, but as feedback, the above-the-fold message on your homepage is buzzword salad, not really clear what you're offering a potential customer here. "Elefunc is building the real-time web, to empower thought-speed creativity."
By way of comparison, I know exactly what I'm getting as a developer on Replit's homepage: "Make something great. Build software collaboratively with the power of AI, on any device, without spending a second on setup"
Thank you! We are just getting started and you offer our first public valuable feedback on the company landing page. I will make sure our above-the-fold becomes just as clear! If you have further suggestions, please send them to support@elefunc.com
It had a cold boot and run on a 8 word STT time of 45 seconds and warm never got past 15 seconds.
This does not work for STT, where it has to be much faster turnaround.
Can anyone give feedback on if whisper and any of its model sizes can work well on serverless?
Do most AI serverless solutions suffer from significant cold boot delays?
The cheapest persistent GPU cloud instance I saw on G was ~$160 a month. Is that roughly the kind of money people need to be prepared to spend to have a model ready to go at all times as a service to another product?