Hacking on PostgreSQL is hard

ncann · on May 3, 2024

Could it be that RDBMS is just inherently very complex? Reminds me of this classic comment about Oracle Database:

https://news.ycombinator.com/item?id=18442941

To quote part of it

> Oracle Database 12.2.

> It is close to 25 million lines of C code.

> What an unimaginable horror! You can't change a single line of code in the product without breaking 1000s of existing tests. Generations of programmers have worked on that code under difficult deadlines and filled the code with all kinds of crap.

> Very complex pieces of logic, memory management, context switching, etc. are all held together with thousands of flags. The whole code is ridden with mysterious macros that one cannot decipher without picking a notebook and expanding relevant pats of the macros by hand. It can take a day to two days to really understand what a macro does.

> Sometimes one needs to understand the values and the effects of 20 different flag to predict how the code would behave in different situations. Sometimes 100s too! I am not exaggerating.

> The only reason why this product is still surviving and still works is due to literally millions of tests!

irjustin · on May 3, 2024

> The only reason why this product is still surviving and still works is due to literally millions of tests!

Personally I consider this a good thing. It's a sign of a really mature codebase where lots of edge cases are known + accounted for.

Even if the underlying code was really well written, simply the number of edge cases hamstring any "quick hacks".

Complex, runs reliably, easy to hack - Pick two

maxbond · on May 3, 2024

Tests are great, but relying on them in this way is like relying on a net to catch you without wearing a harness. It's a good thing if your last line of defense is reliable enough to catch you. But if you're relying on it, it's not a last line of defense, it's the only one.

You should be able to work on software because you understand how it works and what the ramifications of a given change are. Tests and code reviews provide redundancy. But here, they aren't providing redundancy, they're bearing the load.

What provides redundancy if tests are missing, broken, or misinterpreted? Have you ever fixed a bug, gone to write a test for it - and found the test already exists but passed spuriously?

taberiand · on May 3, 2024

In that sort of codebase, the only thing scarier than changing a line of code and breaking thousands of tests, is changing a line of code and not breaking any tests.

pictur · on May 3, 2024

bloodcurdling

sanxiyn · on May 3, 2024

I think it is not RDBMS, rather combinatorial explosion of configuration/flag/option/platform is insidious and we, software engineering as a field, don't know how to do it well.

I think it is one of the largest impact problem in software engineering if it can be improved. Maybe a way to restrict flag interaction and reduce support and test matrix as a result.

jiggawatts · on May 3, 2024

Something I enjoyed was listening to the talks by the LibreSSL team cleaning up the mess in OpenSSL that (in part) caused the Heartbleed bug.

One of their strategies was to drop the macro soup and simply program against the "libc we would like to have", and then add compatibility shims to materialise their ideal libc instead of conditional compilation at the point of use.

brazzy · on May 3, 2024

I suspect that an even bigger cause of the brittleness described in TFA is also that an RDBMS inherently has to deal with concurrency. And not in the way that most applications do - the RDBMS is where other applications push their hairy concurrency problems into.

NomDePlum · on May 2, 2024

The most enlightening part of the article was in the comments where it was observed that given the length of time Postgres has been going, the number of talented developers who have worked on the project there are no easy enhancements or fixes left, just hard knotty problems. (paraphrasing).

It's almost a mark of success of the project. There is obviously a lot of dedication too.

avi_vallarapu · on May 2, 2024

A few things to note

- Postgres documentation is one of the well maintained database documentations. This also means that developers, committers ensure changes to documentations for every relevant patch.

- talk about bugs in postgres compared to MySQl or Oracle or etc databases. Nugs are comparatively lesser or generally rare even if you are supporting postgres services as a vendor with lots of customer. the reason is the efforts involved by a strong team of developers in not accepting anything and everything, there are strict best practices, reviews, discussions, tests, and a lot more that makes it difficult to pass to a release.

- ultimately, more easy is the acceptance of a patch, more the number of bugs.

I love Postgres the way it is today and it still is the dbms of the year and developers most loved database.

I wish we have more Contributors committers, developers and also users and companies supporting Postgres so that the time to push a feature gets more faster and reasonable easier with more support.

MBlume · on May 2, 2024

Coming at this from a naive outsider perspective, the central problem described in the post (commits to PostgreSQL frequently have serious defects which must be addressed in follow-up commits) seems like one that would ideally be addressed with automated testing and CI tooling. What kind of testing does the Postgres project have? Are there tests which must pass before a commit can be integrated in the main branch? Are there tests that are only run nightly? Is most core functionality covered by quick-running unit tests, or are there significant pieces which can only be tested by hours-long integration tests? How expensive is it, in machine-hours, to run the full test suite, and how often is this done? What kinds of requirements are in place for including new tests with new code?

kapilvt · on May 2, 2024

I would also note that the fix prs started landing the day after the initial commit, and other issues noted had fixes within three weeks. And of course postgresql has testing, but at universal distribution and use cases on things that will test both scheduler, network, fs, io drivers (Linux kernel, postgresql, etc, among others), some things need wider audiences or more extreme testing scenarios (SQLite for a strict subset of those considerations), and project health is measured by responding to that in a timely fashion. Afaics this is all about trunk/main, versus releases as well. So while its labeled its hard on the post (from a long time pg contributor), and yeah i might agree (cause maintainer on other software, so yeah all this resonates heavily), I’d also say its an example of things done right.

Seems like a reason to celebrate the open source model, and specifically here on how to do things better. Not to detract from universal issues for any project on maintainer availability. But, imagine a non oss database vendor with that degree of transparency or velocity, i can’t think of any that are doing anything close unless they got popped on a remote cve, aka prioritized above features or politics on a corporate dev sprint. Aka all software has bugs, it’s about how fast things are fixed, and in the context of oss imho fostering evolution among a diverse set of maintainers and use cases seems to be a better way.

As another example of that, ‘twas a PostgreSQL hacker at MS, that prevented Libxz from going wide because of caring due to perf regression and doing the analysis.

bedman12345 · on May 2, 2024

Most database companies run only a small amount of tests before committing. After committing, you run tests for thousands of hours. It sucks. You probably do this all day every day. You just run the tests on whatever you have currently committed. you kind of have to be careful about not adding more tests that make it take much much longer. See https://news.ycombinator.com/item?id=18442941

brazzy · on May 3, 2024

Ahh, thanks, that piece of information suddenly makes TFA makes sense. I was wondering how it could be that those issues were not caught by unit tests before committing/merging, but seemed to be caught soon afterwards in a way that they could still immediately be ascribed to a specific commit.

hyperman1 · on May 2, 2024

What's missing in this post is a deep analysis of what the bugs are and what was causing them, in a 5 times why sense. Especially if they all seem dumb stuff at first.

There are some deep lessons about programming in this Factorio Friday Facts:

https://factorio.com/blog/post/fff-366

and I wonder if postgres doesn't look like fig.1 from this blog post, before the refactoring.

brazzy · on May 3, 2024

I'm pretty sure the answer is usually "concurrency". The examples alluded to in TFA sound like it. Handling concurrency is notably extremely hard, and an RDBMS is what other applications use to solve their hairy concurrency problems so they don't have to.

garyrob · on May 3, 2024

I remember, waaaaay back in the early days of PostgeSQL, I was using it for project and it crashed in a way that corrupted our data. There was no hardware problem; it was just a database crash. (This was quite some time ago. I don't recall what year it was but I'm 68 now and have been a programmer since my senior year in college.) I switched to MySQL for that project. I assume PostgreSQL is not remotely prone to anything like that anymore!

zie · on May 3, 2024

It can still happen sometimes, but it's very rare that it crashes and even more rare that it corrupts as well. The good news, the chances of it corrupting your data silently was basically zero, then and now.

However back then MySQL seemed like it went out of it's way to corrupt your data. The only "bonus" is, it did it all silently, so nobody ever noticed until they went looking. With MariaDB(the successor) it's pretty rare that it silently corrupts your data these days.

jjeaff · on May 3, 2024

I believe a lot of that can be tied back to the default switching from myisam to innodb. crashes frequently resulted in corrupt tables with myisam. innodb is very good at recovering.

garyrob · on May 3, 2024

Could be; but it never happened to me with MySQL! Maybe it depends on on how far back in the day you go...? :)

int_19h · on May 3, 2024

Around early 00s, it was already common wisdom among the web devs that, when it comes to free RDBMS, if you want speed, you use MySQL; and if you want consistency and reliability, then Postgres is where it's at.

(Most people wanted speed.)

zie · on May 3, 2024

Did you ever go looking for corruption? :) Most people don't.

garyrob · on May 4, 2024

The database couldn't be used anymore! Pretty severe corruption.

djbusby · on May 3, 2024

I'm using PostgreSQL since 2002. Never lost anything.

osigurdson · on May 3, 2024

It is rock solid now.

gzapp · on May 3, 2024

I was interested in learning how to created a postgres extension the other week. Not just bundling some SQL scripts, but a proper extension tying into their API.

Trying to find any good information on how to go about this proved super difficult. Well, I wasn't having much luck and just gave up.

djbusby · on May 3, 2024

I've created a few extensions in C. I found the "easy" path was looking at other extensions. The docs helped but bootstrapping my internals knowledge took a lot of time.

sfink · on May 3, 2024

There are some striking similarities to working on another large OSS codebase: Mozilla. (I am employed there.) We have struggled with all of these things for years, and we have a much larger pool of committers (and thus much higher variance in committer abilities). Things today are much better than they used to be, even if all of the same problems are still present to some degree.

Some of what we did might translate well to PostgreSQL, some of it won't, and much of it is probably too expensive and/or too much work. (Then again, it's work that doesn't require an inflight rocket surgeon to accomplish, which means it's doable by a much larger population of developers.)

- we've long had volunteer (and later, employee) "sheriffs" that monitor CI, know how to back things out, and over time get better at recognizing the sorts of problems that come up.

- For slow or expensive tests that don't run on every commit, they'll also take care of "backfilling" test jobs to narrow down which patch or patch stack most likely caused a problem.

- As with most CI systems, there's a staging area that gets a decent level of testing before changes are merged into the main development line.

- Feature gates for larger changes, so things can land in the mainline and be worked on there for a while, with CI regression tests running both with the feature enabled and disabled (as well as feature-specific tests when it's enabled). Good for reducing bit rot.

- Extensive fuzz testing. This would probably need to be specialized to a DB environment, since they're obviously very stateful. Various forms of snapshotting are good. For the browser (and especially the JavaScript engine I work on), it's hard to overstate just how useful this is. I would guess it could work quite well for a DB engine too.

- Lots of resources poured into test machines. With enough machines, good sheriffs, and a rich test suite, test latency doesn't matter all that much. You may not know about the problems for a day or three, but if you can depend on either getting backed out or your feature re-disabled, then you can fire and forget with no guilt. (Ok, the sheriffs will start getting snippy if you bounce a landing too many times, as is their prerogative.)

I'm guessing DB development and testing has tons of idiosyncratic difficulties, but it all sounds so familiar that I think many of the same approaches could work. The inevitable "turning the buildfarm red" should not lead to "spend[ing] the afternoon, or the evening, fixing it..." Complex software is a different beast, and it's unrealistic to expect to be able to break all features down into simple obvious changes. There's just too much going on.

(You still can't handle just anyone committing just anything at any time, though. There will always be a rate of breakage introduction that your system can handle, and it's not hard to go over it.)

galaxyLogic · on May 3, 2024

Wouldn't the testing basically consist of executing large amounts of SQL against an initially empty database, and then executing more SQL to read the current state of the data and verifying it is like it should be?

Such tests could perhaps be database-agnostic to a degree, verifying that the database behaves according to the SQL-standard?

sfink · on May 3, 2024

Those are the easiest tests to write, and probably pretty unlikely to find anything unless the feature you're adding is exposed via SQL pretty directly.

I was thinking more like lots of concurrent operations, and backups/restores (again concurrent with other DB traffic), and replication, and incremental operations, and failover, and error handling in general. All while varying things that feed into scheduling, etc. Anything nondeterministic is good, though that's not all of it. (Annoyingly, that means that failures will quite often be intermittent, which is a whole can of worms of its own.)

dathinab · on May 2, 2024

through writing extensions luckily isn't

hinkley · on May 2, 2024

So is the path to create an extension and hope that someone with a deeper understanding can pull the main features into postgres major_version+1?

dathinab · on May 2, 2024

> deeper understanding can pull the main features into postgres major_version+1?

for a lot of stuff you don't need it, sometimes even don't want it

paulddraper · on May 2, 2024

Many features should only be extensions.

thomasfromcdnjs · on May 3, 2024

Are there good resources on learning how to?

skywhopper · on May 2, 2024

lol, of course the first comment is essentially “it would probably be easier if it were written in Rust”.

uhoh-itsmaciek · on May 2, 2024

I laughed at that, too, but the commenter, Greg Smith (disclosure: a former colleague) has been involved with Postgres for ages, and concludes that Rust would not actually be a magic bullet here.

dpc_01234 · on May 3, 2024

Not a magic bullet, but there's hell a lot of difference between maintaining C and Rust code.

In Rust it's much easier to create robust and performant abstractions that are near impossible to misuse, eliminating tons of potential bugs right away.

In Rust you don't need 10 years of experience and staring with a microscope and at every line to ensure it doesn't introduce issues. In Rust trivial changes and indeed trivial, so you have more time to think about actually difficult parts.

In Rust contributors don't need to learn custom implementation for collections, string and other basic primitives on every project.

And most of all, people actually want to learn and work with Rust, so your contributor pool is expanding, not shrinking.

influx · on May 2, 2024

Just make it part of systemd, problem solved.

farmeroy · on May 3, 2024

This post is like some abstract allegory written by Borges or Kafka, I love it

Deeg9rie9usi · on May 2, 2024

Original submission: https://news.ycombinator.com/item?id=40234398

dang · on May 2, 2024

OP was original submission—you can tell this because its ID is smaller.

Timestamps get relativized when we re-up a post [1]. That's an artifact of HN's re-upping system [2].

[1] https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

[2] https://news.ycombinator.com/item?id=26998308

Sorry for the confusion—I know it's weird but the alternative turns out to be even more confusing and we've never figured out how to square that circle!

mannyv · on May 2, 2024

TL; DR: testing is hard, and developers aren't very good at it.

It also sounds like internal documentation may be lacking, which isn't surprising.

sanxiyn · on May 2, 2024

Well, PostgreSQL has one of the best internals documentation, it is just that internals are complex, mostly by necessity.

https://www.postgresql.org/docs/current/internals.html

api · on May 2, 2024

[flagged]

quonn · on May 2, 2024

And what does „hard“ even mean? Hard for whom? Hard for the average person? For the average developer? Hard for an expert on a particular topic? Hard for someone having practiced a lot or not?

pstuart · on May 2, 2024

It seems like having a more robust testing framework might help, akin to what SQLite uses?

Obviously easier said than done, but the way he presents things it seems kind of like a push and pray environment.

DowagerDave · on May 2, 2024

I don't think that's fair. The PostgreSQL codebase includes a lot of stuff that isn't included in SQLite, where it's covered by 3rd party projects - I'd bet their quality is nothing like SQLite or PostgreSQL. Tom Lane was involved in these commits and sounds like some of this got by him. His comments have frequently been described as "complete technical manuals", so I think that speaks to the complexity.

paulddraper · on May 2, 2024

PostgreSQL is an order of magnitude more complex and extensible than SQLite.

Any comparison is basic at best.

x0x0 · on May 3, 2024

I suspect it may be closer to 2 orders of magnitude than 1.

eg look at the pg docs re: at least 5 different types of indices are implemented https://www.postgresql.org/docs/current/indexes.html

Everything is like that.

(I'm not bashing sqlite, it's great software, but it has a tiny fraction of the product footprint that pg has.)

paulddraper · on May 3, 2024

*6 different index types.

And that's just core, not even including built-in extensions like bloom.

swader999 · on May 2, 2024

PostgreSQL has different execution paths depending on load, table size, hardware etc which also complicates testing and test setup.