When I was deciding on a queuing system for the product I'm writing, after seeing the price of running a Kafka cluster on Digital Ocean and others, I opted for polling in postgresql and have been quite satisfied with its work so far (6ish months running). I have some improvements I need to make to the queue listener (right now, all the workers poll until there's work, when I could have it back off to 1); but otherwise it's plenty good enough.
I'm really surprised at the price of Kafka clusters, to be honest. Hundreds of dollars per month extra at minimum.
I tried using a PostgreSQL SKIP LOCKED queue once, turned out to be a mistake.
It was a queue for a distributed worker pool. The simpler alternatives I was used to at the time (RabbitMQ) did not support joining (i.e. run task Y when all of X1~X20 are complete) and therefore every task was stored in the database, anyway. I don't remember the exact numbers, but it was a light/moderate load--thousands, maybe tens of thousands of rows per day. It ran smoothly with an external message queue. I'd vacuum maybe every 4 months.
For one iteration, I decided to try using PostgreSQL as the queue system as well, to decrease the number of moving parts. It performed fine for a bit, then slowed to a crawl. I was sure I was doing something wrong--except every guide told me this was how to use a table as a queue. If I missed anything, it must've been PG-as-a-queue-specific.
What would you do if you wanted to pick up jobs in close to real time? Listen/notify? An extension that adds long polling? Consuming the WAL? Giving up and using a different technology?
I'm working on something similar, and these are the options I've been mulling over. They each come with pretty significant drawbacks. My current plan is to use listen/notify because it's low hanging fruit and to see how that pans out, but was wondering if anyone has been down this path before and has wisdom they're willing to share.
As a generality, you either need something done Now, in which case don't put it in a queue, just do it immediately.
Or you need it done Not Right Now, in which case you stick it in a queue.
Or maybe you need something done Now, but on a different computer, in which case you need RPC of some sort. I would not recommend reinventing RPC using listen/notify.
Durability is more important to this use case than performance. I want the option of picking up a jobs within, say, a minute, but I am willing to accept a much worse worst case so long as I never lose track of a job.
Another scenario to consider; a job fails, because of a transient error local to the worker (for instance, it may have exceeded an IP based rate limit on an API it consumes). If we drew another worker, it would succeed without a hitch. I want to put the job back into the queue and pick it up again quickly. But I can't just keep going on the worker I have.
Autoscaling workers polling in a tight loop seems like a way to bring down production to me. Database is getting too many queries, causing the job queue to grow, causing more workers to spin up, causing a higher load of queries, ....
Here's the bug. If there are no tasks pending at the moment — you sleep, for example for 10 seconds. The code in the presentation had this handled correctly.
Then the database load from transactions that didn't match a job won't be noticeable until there are thousands of workers. And when (and if!) there are, I don't have experience with that kind of scale but I suspect that all the bottlenecks will still be in the other branch — transactions that modify the database.
If you want close to real time performance, why go for a task table at all? You can spawn threads and track them instead provided you have the resources. If you don't have the resources, I don't see why the minor listen/notify overhead would bother you, since tasks will wait some of the time regardless.
I explain in another comment that I want jobs to be durable, so I can't just spin up a new thread.
It's not the performance overhead that bothers me as much as the session-sticky state and additional complexity in the database, my instinct is that it will cause outages in ways that take significant Postgres chops to diagnose and correct. That's just an instinct though, it could be far off base.
Another postgres queue. I'm missing a variant that allows multiple consumers to process one payload, and one where you can parallelize based off a certain key (like Kafka does).