Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've been using Cloud Run for my GPT-2 text generation apps (https://github.com/minimaxir/gpt-2-cloud-run) in order to survive random burst, and also for small Twitter bots (https://github.com/minimaxir/twitter-cloud-run/tree/master/h...) which can be invoked via Cloud Scheduler to utilize the efficiency benefits. It has been successful in those tasks.

The only complaint I have with Cloud Run now (after many usability updates since the initial release) is that there is no IP rate-limiting to prevent abuse, which has been the primary cause of unexpected costs. (due to how Cloud Run works, IP rate-limiting has to be on Google's end; implementing it on your end via a proxy eliminates the ease-of-use benefits)



I'm currently serving an api that uses a 500mb resnet v2 model. The bootup takes to long, so now I have a single instance that can't handle any peaks and costs too much. Doesn't your model take to long to spin up before being able to serve a request ?


From: https://github.com/ahmetb/cloud-run-faq#how-to-keep-a-cloud-...

---

How to keep a Cloud Run service “warm”?

You can work around "cold starts" by periodically making requests to your Cloud Run service which can help prevent the container instances from scaling to zero.

Use Google Cloud Scheduler to make requests every few minutes.

Does my application get multiple requests concurrently?

Contrary to most serverless products, Cloud Run is able to send multiple requests to be handled simultaneously to your container instances.

Each container instance on Cloud Run is (currently) allowed to handle up to 80 concurrent requests. This is also the default value.

What if my application can’t handle concurrent requests?

If your application cannot handle this number, you can configure this number while deploying your service in gcloud or Cloud Console.

Most of the popular programming languages can process multiple requests at the same time thanks to multi-threading. But some languages may need additional components to do concurrent requests (e.g. PHP with Apache, or Python with gunicorn).


Yes unfortunately, but that's the caveat of services-on-demand. I'm looking more into more efficient/cheap model deployment workflows. (it might be just running the equivalent of Cloud Run on Knative/GKE, backed by GPUs)


Does GCP have an equivalent for Aurora Serverless? If so, would choosing that over CloudSQL been cheaper?

If you're familiar with AWS, would using AWS Batch exclusively with Spot pricing [0] (or Fargate with Spot pricing) and Aurora Serverless [1] been cheaper than Cloud Run + CloudSQL?

[0] Say, the service runs for 5 mins every hour for 30 days. The respectable a1.large instance would cost $0.50 per month, the cheapest t3.nano would cost around $0.19 per month.

[1] Say, the service stores rolling 10 GiB/Month ($1.00) and does about 1000 calls per 5 minutes every hour ($0.20) using 2 ACUs ($3.60). This would cost $4.80 per month.


Tensorflow serving, or whatever name Google cloud has given their managed version.


You can run it on GKE with autoscaling.


How much do you end up paying on average for a tweet-sized generated text?


I haven't create a service for auto-generated tweets yet (just human-curated ones), but for similar service which output tweet-length text (w/ a 2GiB RAM side), it takes about 30s on a cold boot (which makes sense as it has to load the model to RAM), and ~12s to generate text after a cold boot.

From the pricing (https://cloud.google.com/run/pricing):

12 * ($0.00002400) + 12 * (2 * $0.00000250) = $0.000348 per text

...and that's assuming you go over the free tier limit.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: