Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> I do not understand why people aren't clamouring against postgres's connection model.

There are people wanting to change that - including me, the author of the blog post. I explained in an earlier blog post ([1]) why I chose to work on making snapshots more scalable at this time:

> Lastly, there is the aspect of wanting to handle many tens of thousands of connections, likely by entirely switching the connection model. As outlined, that is a huge project / fundamental paradigm shift. That doesn’t mean it should not be tackled, obviously.

> Addressing the snapshot scalability issue first thus seems worthwhile, promising significant benefits on its own.

> But there’s also a more fundamental reason for tackling snapshot scalability first: While e.g. addressing some memory usage issues at the same time, as switching the connection model would not at all address the snapshot issue. We would obviously still need to provide isolation between the connections, even if a connection wouldn’t have a dedicated process anymore.

> However this means that even if your server can theoretically handle 5000 concurrent reads you will never get there because opening 5000 connections isn't practical. You'll likely hit a memory limit.

It's quite possible to have 5000 connections, even leaving poolers aside. When using huge_pages=on, a connection has an overhead of < 2MiB ([2]). Obviously 10GiB isn't peanuts, but it's also not a crazy amount.

> Why is this generally accepted as ok?

Postgres is an open source project. It's useful in a lot of cases. It's not in some others - partially due to non-fundamental limitations. There's a fairly limited set of developers - we can only work on so many things at a time...

> Decoupling concurrency from connections seems possible via pipelining commands over a single connection.

Could you expand on what you mean here?

> Is it just too late for a project as mature as Postgres?

No. It's entirely doable to decouple processes and connections. It however definitely is a large project, with some non-trivial prerequisites.

[1] https://techcommunity.microsoft.com/t5/azure-database-for-po...

[2] https://blog.anarazel.de/2020/10/07/measuring-the-memory-ove...

EDIT: formatting woes



There is a much better reference somewhere (possibly from Ingres times, or later), but here is Stonebraker describing how PostgreSQL ended up with connection-per-process in 1986:

> DBMS code must run as a sparate process from the application programs that access the database in order to provide data protection. The process structure can use one DBMS process per application program (i.e., a process-per-user model [STON81]) or one DBMS process for all application programs (i.e., a server model). The server model has many performance benefits (e.g., sharing of open file descriptors and buffers and optimized task switching and message send- ing overhead) in a large machine environment in which high performance is critical. However, this approach requires that a fairly complete special-purpose operating system be built. In constrast, the process-per-user model is simpler to implement but will not perform as well on most conventional operating systems. We decided after much soul searching to implement POSTGRES using a process-per-user model architecture because of our limited programming resources. POSTGRES is an ambitious undertaking and we believe the additional complexity introduced by the server architecture was not worth the additional risk of not getting the system running. Our current plan then is to implement POSTGRES as a process-per-user model on Unix 4.3 BSD.

(THE DESIGN OF POSTGRES, https://dsf.berkeley.edu/papers/ERL-M85-95.pdf )

There is another reference directly related to Postgres or PostgreSQL that made it even more clear, I expect it was probably later on. In effect it indicated someone involved in the project had strong intentions of getting to adding threading "real soon now". I'll update the comment if I figure out where that's from.

Threads were still a research thing by the mid 80s, so its absence from such an old design is easy to understand. Pthreads wasn't even standardized until 1996, although several unices (e.g. SunOS) already had popular implementations long before that.


You might be thinking of

> In POSTGRES they are run as subprocesses managed by the POSTMASTER. A last aspect of our design concerns the operating system process structure. Currently, POSTGRES runs as one process for each active user. This was done as an expedient to get a system operational as quickly as possible. We plan on converting POSTGRES to use lightweight processes available in the operating systems we are using. These include PRESTO for the Sequent Symmetry and threads in Version 4 of Sun/OS.

From: The implementation of POSTGRES - Michael Stonebraker, Lawrence A. Rowe and Michael Hirohama

Hat tip to Thomas Munro. I think he pointed this quote out to me in the past.


Thanks for doing this work! Postgres is amazing and this area is the one place where eg. MySQL has a clear lead.

> Postgres is an open source project. It's useful in a lot of cases. It's not in some others - partially due to non-fundamental limitations.

For me at least, postgres has the problem that it's too useful. It has so many clear advantages over the alternatives that it often makes sense to choose postgres over more specialized tools.

For example, even if I was primarily storing JSON data, postgres is still a good choice, because it offers better consistency guarantees than most document databases, and has a more powerful query model.

The end result is that the few limitations remaining (eg. the connection model) are felt all the more strongly: I cannot simply say "ah I need lots of connections, I'll use MySQL here" without also giving up on all sorts of other features that I would normally use. (Although MySQL is still improving)


I wasn't criticizing Postgres as much as I was a bit confused why I don't hear more people talking about this issue.

I think poolers confuse the issue because they solve the problem of allowing multiple processes to share the same pool, they don't increase throughput to the server.

The 5000 connection example was probably missing some context. I actually was using 10mb per connection as the overhead which creates a decently expensive server for not a lot of throughput.

By pipelining what I mean is allowing multiple requests to be in flight on the connection, each tagged with an ID that gets queued on the server. When a worker process gets around to processing it, it'll send back a response with the same ID. This increases driver complexity but also makes it so you need very few connections per client.


> Could you expand on what you mean here?

Not sure what OP was thinking about, but for instance SQL Server supports something called “multiple active result sets” (MARS) on a single connection.

https://docs.microsoft.com/en-us/sql/relational-databases/na...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: