How/Why to Sweep Async Tasks Under a Postgres Table [video]

surprisetalk · on June 16, 2024

Here's the companion essay with more code snippets: https://taylor.town/pg-task

698969 · on June 16, 2024

In one of the snippets, there is a 10 second sleep inside a transaction to add delays between polling if there are no tasks,

Shouldn't that be outside the transaction?

surprisetalk · on June 16, 2024

Yes, it would definitely be more performant to move it outside of the transaction :) good catch

t-writescode · on June 16, 2024

When I was deciding on a queuing system for the product I'm writing, after seeing the price of running a Kafka cluster on Digital Ocean and others, I opted for polling in postgresql and have been quite satisfied with its work so far (6ish months running). I have some improvements I need to make to the queue listener (right now, all the workers poll until there's work, when I could have it back off to 1); but otherwise it's plenty good enough.

I'm really surprised at the price of Kafka clusters, to be honest. Hundreds of dollars per month extra at minimum.

aib · on June 16, 2024

I tried using a PostgreSQL SKIP LOCKED queue once, turned out to be a mistake.

It was a queue for a distributed worker pool. The simpler alternatives I was used to at the time (RabbitMQ) did not support joining (i.e. run task Y when all of X1~X20 are complete) and therefore every task was stored in the database, anyway. I don't remember the exact numbers, but it was a light/moderate load--thousands, maybe tens of thousands of rows per day. It ran smoothly with an external message queue. I'd vacuum maybe every 4 months.

For one iteration, I decided to try using PostgreSQL as the queue system as well, to decrease the number of moving parts. It performed fine for a bit, then slowed to a crawl. I was sure I was doing something wrong--except every guide told me this was how to use a table as a queue. If I missed anything, it must've been PG-as-a-queue-specific.

d0mine · on June 16, 2024

Was the vacuum enabled? Were there long orphaned transactions?

maxbond · on June 16, 2024

What would you do if you wanted to pick up jobs in close to real time? Listen/notify? An extension that adds long polling? Consuming the WAL? Giving up and using a different technology?

I'm working on something similar, and these are the options I've been mulling over. They each come with pretty significant drawbacks. My current plan is to use listen/notify because it's low hanging fruit and to see how that pans out, but was wondering if anyone has been down this path before and has wisdom they're willing to share.

hamandcheese · on June 16, 2024

How close to real time are you talking?

As a generality, you either need something done Now, in which case don't put it in a queue, just do it immediately.

Or you need it done Not Right Now, in which case you stick it in a queue.

Or maybe you need something done Now, but on a different computer, in which case you need RPC of some sort. I would not recommend reinventing RPC using listen/notify.

maxbond · on June 16, 2024

Durability is more important to this use case than performance. I want the option of picking up a jobs within, say, a minute, but I am willing to accept a much worse worst case so long as I never lose track of a job.

Another scenario to consider; a job fails, because of a transient error local to the worker (for instance, it may have exceeded an IP based rate limit on an API it consumes). If we drew another worker, it would succeed without a hitch. I want to put the job back into the queue and pick it up again quickly. But I can't just keep going on the worker I have.

owl57 · on June 16, 2024

> within, say, a minute

With this definition of real-time, you can get pretty far just polling for pending jobs.

maxbond · on June 16, 2024

Autoscaling workers polling in a tight loop seems like a way to bring down production to me. Database is getting too many queries, causing the job queue to grow, causing more workers to spin up, causing a higher load of queries, ....

jhgg · on June 16, 2024

Set a reasonable maximum for your auto-scaler to avoid this issue.

owl57 · on June 16, 2024

> in a tight loop

Here's the bug. If there are no tasks pending at the moment — you sleep, for example for 10 seconds. The code in the presentation had this handled correctly.

maxbond · on June 16, 2024

I expressed myself in a confusing way, I just meant "in a loop" and understood that to be a 60s sleep.

owl57 · on June 16, 2024

Then the database load from transactions that didn't match a job won't be noticeable until there are thousands of workers. And when (and if!) there are, I don't have experience with that kind of scale but I suspect that all the bottlenecks will still be in the other branch — transactions that modify the database.

ivanbakel · on June 16, 2024

If you want close to real time performance, why go for a task table at all? You can spawn threads and track them instead provided you have the resources. If you don't have the resources, I don't see why the minor listen/notify overhead would bother you, since tasks will wait some of the time regardless.

maxbond · on June 16, 2024

I explain in another comment that I want jobs to be durable, so I can't just spin up a new thread.

It's not the performance overhead that bothers me as much as the session-sticky state and additional complexity in the database, my instinct is that it will cause outages in ways that take significant Postgres chops to diagnose and correct. That's just an instinct though, it could be far off base.

RedShift1 · on June 16, 2024

Another postgres queue. I'm missing a variant that allows multiple consumers to process one payload, and one where you can parallelize based off a certain key (like Kafka does).

surprisetalk · on June 16, 2024

If I understand you correctly, it would be very easy to do extend this in the code sample here: https://taylor.town/pg-task

For a given payload, you can spawn threads for processing however you want.