We’ve been seeing the same trend. Lots of teams moving to Hetzner for the price/performance, but then realizing they have to rebuild all the Postgres ops pieces (backups, failover, monitoring, etc.).
We ended up building a managed Postgres that runs directly on Hetzner. Same setup, but with HA, backups, and PITR handled for you. It’s open-source, runs close to the metal, and avoids the egress/I/O gotchas you get on AWS.
If anyone’s curious, I added here are some notes about our take [1], [2]. Always happy to talk about it if you have any questions.
This is one key draw to Big Cloud and especially PaaS and managed SQL for me (and dev teams I advise).
Not having an ops background I am nervous about:
* database backup+restore
* applying security patches on time (at OS and runtime levels)
* other security issues like making sure access to prod machines is restricted correctly, access is logged, ports are locked down, abnormal access patterns are detected
* DoS and similar protections are not my responsibility
It feels like picking a popular cloud provider gives a lot of cover for these things - sometimes technically, and otherwise at least politically...
Applying security patches on time is not much problem. Ones that you need to apply ASAP are rare and for DB engine you never put it on public access, most of the time exploit is not disclosed publicly and PoC code is not available for patched RCE right on day of patch release.
Most of the time you are good if you follow version updates for major releases as they come you do regression testing and put it on prod in your planned time.
Most problems come from not updating at all and having 2 or 3 year old versions because that’s what automated scanners will be looking for and after that much time someone much more likely wrote exploit code and shared it.
There must be SaaS services offering managed databases on different providers, like you buy the servers they put the software and host backups for you. Anyone got any tips?
to be fair, AWS' database restore support is generally only a small part of the picture - the only option available is to spin an entirely new DB cluster up from the backup, so if your data recovery strategy isn't "roll back all data to before the incident", you have to build out all your own functionality for merging the backup and live data...
Yeah, and that default strategy tends to become very, very painful the first time you encounter non-trivial database corruption.
For example, one of my employers routinely tested DB restore by wiping an entire table in stage, and then having the on call restore from backup. This is trivial because you know it happened recently, you have low traffic in this instance, and you can cleanly copy over the missing table.
But the last actual production DB incident they had was a subtle data corruption bug that went unnoticed for several weeks - at which point restoring meant a painful merge of 10s of thousands of records, involving several related tables.
For sure. It's more about having a pipeline for pulling data from multiple sources - rather than spin up a whole new DB cluster, you usually want to pull the data into new tables in your existing DB, so that you can run queries across old & new data simultaneously
Exactly this. For a small team that's focused on feature development and customer retention, I tend to gladly outsource this stuff and sleep easy at night. It's not even a cost or performance issue for me. It's about if I start focusing on this stuff, what about my actual business am I neglecting. It's a tradeoff.
I can attest to that. At Cloud 66 a lot of customers tell us that while the PaaS experience on Hetzner is great, they benefit from our managed DBs the most.
While I'm sure it's a great project, a few issues in the README gave me pause to think about how well it's kept up to date. Around half of the links in the list of dependencies are either out of date or just plain don't work, and referencing Vagrant with no mention of Docker.
It's indeed undermaintaned so it's not a case of only plug-and-play and automated pulls for production. Still a solid base to build from when setting up on VMs or dedicated and I'm yet to find something better short of DIYing everything.
If you are looking for Postgres on Hetzner, you may want to check out Ubicloud.
We host in various bare metal providers, including Hetzner. (I am the lead engineer building Ubicloud PostgreSQL, so if you have questions I can answer them)
Yes, that is correct. That said, in our tests we only saw 2x improvements in CH benchmarks. However, we found out that it was due to an architectural issue in our VM I/O path and how we virtualize the storage. Based on our estimations we should see ~5x difference but for that we need to revamp our storage virtualization first.
We have plans for publishing a CH benchmark results on a follow up blog post. However, we didn't want to do that for now to not put misleading results.
Really appreciated the authors persistence on keeping to use PostgreSQL. There are many specialized solutions out there, but at the end they usually lack PostgreSQL's verstatility and battle testedness.
I read something similar on Yuval Harari's Homo Sapiens, where he suggests wheat domesticated humans not the other way around. An excerpt can be found here [1]. Whole essay is great but I especially liked this part:
> The word “domesticate” comes from the Latin domus, which means “house.” Who’s the one living in a house? Not the wheat. It’s the Sapiens.
I don't remember the details of their arguments, but Graeber and Wengrow think this is a misleading image. IIRC one of their main thrusts was that over long periods of history, groups of humans have adopted and abandoned stationary agriculture at will, as conditions indicate.
I suppose that makes us as domesticated as e.g. lions or chimpanzees, which have been known to e.g. share food with humans ("work for them") in the wild but it's not their reason for existence.
I lent out my copy of the Dawn of Everything so I can't get exact quotes or pages but this reminded me of a point in the book (which I highly recommend) which I'll attempt to summarize:
Domestication of plants was "easy" when tested in a controlled setting selecting seeds carefully at a university. Estimated that wheat in the agricultural "revolution" (a much scoffed about term in the book) could have been domesticated in 200 years if purposeful. Instead agriculture took something like 3000 years to become dominant versus mixed food sources (mostly gathering, fishing and hunting, with some low-effort planting on riverbanks).
And yes to your point, the idea that there is some sort of progression in human societies is contradicted by the recent decades of evidence in archeology -- every arrangement you can imagine seems to have been tried (stationary+hunter/gather, nomadic farmer, alternating back and forth, shifts toward farming for hundreds of years and then back to fishing for thousands). Humans time on the earth has been much longer than our recorded history, with more variety and less boring than we usually assume.
Anyway I hope that inspires someone to pick up the book, it really is a good read.
>IIRC one of their main thrusts was that over long periods of history, groups of humans have adopted and abandoned stationary agriculture at will, as conditions indicate.
In general they still totally depend on it.
So this would be like saying dogs aren't domesticated, because some left their owners or bit them, or there are groups of stray dogs here and there.
What makes you say they still totally depended on it? I can easily imagine groups of humans having a period of settled agriculture for convenience rather than necessity.
My theory is that multicellular life itself was developed because viruses wanted a more effective way to travel. Humans are the pinnacle of virus transportation technology, and they've developed very successful behavioral override countermeasures against our pesky use of vaccines.
He also talked about this "reverse chain of command" in the recent talk at Peking university:
Human evolves from worm. Human brain is originally a bunch of neurons centered around the worm's mouth to search for food. It is natural to think human is still controlled by stomach to this day (or spinal cord for that matter).
At the time of our investigation, we found few articles supporting that power caps could potentially cause hardware degradation, though I don't have the exact sources at hand. I see the child comment shared one example, and after some searching, I found a few more sources [1], [2].
That said, I'm not an electronics engineer, so my understanding might not be entirely accurate. It’s possible that the degradation was caused by power fluctuations rather than the power cap itself, or perhaps another factor was at play.
The power used by a computer isn't limited by giving it less voltage/current than it should have - if it was, the CPU would crash almost immediately. It's done by reducing the CPU's clock rate until the power it naturally consumes is less than the power limit.
Yeah, this is generally a good practice. The silver lining is that our suffering helped uncover the underlying issue faster. :)
This isn’t part of the blog post, but we also considered getting the servers and keeping them idle, without actual customer workload, for about a month in the future. This would be more expensive, but it could help identify potential issues without impacting our users. In our case, the crashes started three weeks after we deployed our first AX162 server, so we need at least a month (or maybe even longer) as a buffer period.
>The silver lining is that our suffering helped uncover the underlying issue faster.
Did you actually uncover the true root cause? Or did they finally uncap the power consumption without telling you, just as they neither confirmed nor denied having limited it?
The root cause was a problem with the motherboard, though the exact issue remains unknown to us. I suspect that a component on the motherboard may have been vulnerable to power limitations or fluctuations and that the newer-generation motherboards included additional protection against this. However, this is purely my speculation.
I don't believe they simply lifted a power cap (if there was one in the first place). I genuinely think the fix came after the motherboard replacements. We had 2 batches of motherboard replacements and after that, the issue disappeared.
If someone from Hetzner is here, maybe they can give extra information.
Were you able to identify the manufacturer and model/revision of the failing motherboards? This would be extremely helpful when shopping for seconds hand servers.
Definitely interesting material. I realized, especially in last few years, there is an increased interest on moving away from propriety clouds/PaaS to K8s or even to bare metal, primarily driven by high prices and also interest of having more control.
At Ubicloud, we are attacking the same problem, though from a different angle. We are building an open-source alternative to AWS. You can host it yourself or use our managed services (which are 3x-10x more affordable than comparable services). We already built some primitives such as VMs, PostgreSQL, private networking, load balancers and also working on K8s.
I have a question to HN crowd; which primitives are required to run your workloads? It seems the OP's list consists of Postgres, Redis, Elasticsearch, Secret Manager, Logging/Monitoring, Ingress and Service Mesh. I wonder if this is representative of typical requirements to run HN crowd's workloads.
Quite simple, I want to submit a Docker image, and have it accept HTTP requests at a certain domain, with easy horizontal/vertical scaling. I'm sure your Elastic Compute product is nice but I don't want to set it up myself (let alone run k8s on it). Quite like fly.io.
PS: I like what you guys are doing, I'd subscribe to your mailing list if you had one! :)
Sure you can, but Let's Encrypt, just like DigiCert, is a 3rd party provider and they don't guarantee that you would get a signed certificate in few minutes. If they have an outage, it could take hours to get a certificate and you wouldn't be able to provision any database servers during that time. In our previous gig at Microsoft, we had multiple DigiCert outages which blocked the provisionings.
I personally, anecdotally, haven't had any problems with this the last years, and it doesn't seem like this is a big issue based on the information from the incident forum posts:
https://community.letsencrypt.org/c/incidents/16/l/top
Self signing probably causes quite a few other issues, even though you have more control of the process, doesn't it?
I cannot comment on Let's Encrypt's reliability. Maybe I had just too many bad experiences from DigiCert outages and I'm bit pessimistic. However, their status page does not give much confidence https://letsencrypt.status.io/pages/history/55957a99e800baa4...
I think if you need to generate a certificate once in a while, using Let's Encrypt or DigiCert is OK. Even if they are down, you can wait for few hours. If you need to generate a certificate every few minutes, few hours of downtime means hundreds of failed provisionings. Hence, we opted for self-signing.
In terms of reliability, it is great, because we control everything. It is also quite fast; it takes few seconds to generate and sign a certificate. The biggest drawback is that you need to distribute the certificate for CA as well. Historically, this was fine, because you need to pass CA cert to PostgreSQL as a parameter anyway, so the additional friction for users that we introduced due to CA cert distribution was low. However with PG16, now there is an option sslrootcert=system, which automatically uses OS trusted CA roots certs. Now the alternative is much seamless and requires almost no action from user, which tilted the balance in favor of globally trusted CAs, but still it doesn't give me enough reason for the switch.
I have few ideas around simultaneously self signing a cert and also requesting certificate from Let's Encrypt. The database can start with the self signed certificate at the beginning and we can switch to Let's Encrypt certificate when it is ready. Maybe I'd implement something like that in the future.
At another thread in this page, I wrote more about this, but in summary; we also like k8s-based managed Postgres solutions. They are quite useful if you are running Postgres for yourself. In managed Postgres services offered by hyperscalers or companies like Crunchy though, it is not used very commonly.
At k8s, isolation is at the container level, thus properly isolating (for security purposes) system calls is quite difficult. This wouldn't be a concern if you are running Postgres for yourself.
Also for us, one reason was operational simplicity. You can write a control plane for managed Postgres in 20K lines of code, including unit tests. This way, if anything breaks at scale, you can quickly figure out the issue without having to dive into dependencies.
We ended up building a managed Postgres that runs directly on Hetzner. Same setup, but with HA, backups, and PITR handled for you. It’s open-source, runs close to the metal, and avoids the egress/I/O gotchas you get on AWS.
If anyone’s curious, I added here are some notes about our take [1], [2]. Always happy to talk about it if you have any questions.
[1] https://www.ubicloud.com/blog/difference-between-running-pos... [2] https://www.ubicloud.com/use-cases/postgresql