Sorry, but your testing methodology needs some help. Were you testing OSes? Postgres? Storage configuration? I suggest investigating FIO first, use it to isolate the best performing disk configuration (Storage+Kernel+Filesystem+whatever), then do some pgbench tests with different postgres tuned parameters, to show the best way to tune postgres.
A few thoughts:
* You weren't testing OSes which the subject implied, you were testing Linux kernel variants and their stock OS configurations/kernel scheduler setups, and FreeBSD was tossed into the mix. Whether you are running Ubuntu, CentOS, Debian or whatever you should have the Linux kernel tuned to perform well, so adding the distribution as a variable is just a red herring. I'd be more interested in removing that variable and comparing different storage configurations (such as XFS, and LVM).
* Clients connecting over the network adds a huge variable at play (the network) -- ideally you would want to remove this.
* I may have missed it, but it wasn't clear if you had a warmup period to your benchmarks. Especially with a system like ZFS which has COW, you need to do a few benchmarks on the same blocks first, to break past the cache.
He did adequately disclose that the OS was swapped out while pgsql and settings were held steady. Adding in more arbitrary benchmarks like FIO wont really clarify pgsql performance if that is the desired investigation. Instead, active bench marking where you identify the bottlenecks in each tested platform would be a better use of time, and you could hypothesize how to improve each platform if you were to pick it.
As a counter-example, I could easily cherry pick versions, tunables and patch sets to make the numbers go whichever way I want so these types of comparisons aren't that useful unless someone is dropping a big delta on the floor with out of the box settings vs another.
As a FreeBSD developer, I will actually tell you that Linux could be selected to graph massive wins by cherry picking hardware with very high core count and several NUMA domains. But even then, by selecting kernel features (which could be innocuously hidden in a version number/vendor patch set) you can cherry pick large swings. That said, FreeBSD+ZFS+PGSQL (https://www.slideshare.net/SeanChittenden/postgresql-zfs-bes...) is a joy to administer, and is unlikely to be the weak link in a production setup if you stick to a two socket system. There is a lot of work going on in HEAD that is relevant to this workload in the memory management subsystem, including NUMA support. And some TCP accept locking changes that'd be relevant for TCP connection turnover.
People benchmark CentOS too much. A far more interesting test would be Oracle Linux, both with and without the UEK.
I would bet this would wipe out the SUSE advantage:
# cat /proc/version
Linux version 4.1.12-112.14.1.el7uek.x86_64 (mockbuild@) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #2 SMP Fri Dec 8 18:37:23 PST 2017
The "RedHat-compatible kernel" should have identical performance to CentOS.
# rpm -qa | grep ^kernel | sort
kernel-3.10.0-693.11.1.el7.x86_64
...
kernel-uek-4.1.12-112.14.1.el7uek.x86_64
...
I realize that many people don't like Oracle Linux due to their unauthorized appropriation and support of their RedHat clone. It does bring new functionality to the table, however (primarily Ksplice) and has great support for the eponymous database.
When RedHat 9 support ended, it never touched my personal systems again. Even my CentOS habit has quelled.
Hello, OP here. I'm certain that you can fine tune every OS for specific use case. I may indeed do that in a future blogpost. The question is what to compare ? Should I compare Linux kernel versions, PostgreSQL versions, filesystems (and features like compression, block size, ...) ? As you can see the permutations are endless and thats why I compared stock OSes with their default filesystems of choice.
I don't think that a Linux distribution is just a variable and the only thing that differs is the kernel version. Each distro made its own choices, for better or worse...
As for the clients connecting over the network - that was exactly my point. My idea was to benchmark in conditions similar to production deployment. I doubt that many production systems connect over unix socket.
And for the warmup period, as you can see in the benchmarking script there is a 30 min warmup period before I start to record the results.
> testing Linux kernel variants and their stock OS configurations/kernel scheduler setups, and FreeBSD was tossed into the mix
For a lot of people who don't have time/understanding to play with things much beyond using stock versions and configuration, this could still produce a useful benchmark. People who have the knowledge, confidence, and time, probably won't be reading the article at all as they'll have already performed their own less artificial tests (i.e. benchmarking with their own application using live-like data and load patterns).
Though that is my argument against any benchmark like this: it is at best an indicator of peak activity under very specific conditions, a starting point but it doesn't really represent my application with any precision nor accuracy.
> it wasn't clear if you had a warmup period to your benchmarks
I think it's pretty fair to test each OS with its filesystem of choice. I'm aware that you can use ZFS on Linux, but I'm not (yet) brave enough to recommend ZFS+Linux. And yes there's btrfs but would you trust it with your data ? :)
I kind of miss network traffic diagrams; just mentioning its connecting over Gbit isn't enough for me. Is there 200 mbit sql traffic going back/forth, or just 8 mbit?
The glaring difference will probably be a ZFS thing. On some systems the default page size is 128k whereas databases usually use 8k. I didn't read closely enough to see what the story is with async writes, or the ZIL, or whatever. tl;dr - there are huge gains to be had with ZFS with a little tuning.
I think the BSDs have always left it to the administrator to optimize the system to their needs, and it didn’t look like OP did any optimization for the first set of results.
You might also try setting primarycache=metadata, to avoid double caching the data. From what I have been told, this can help reduce memory pressure of the ARC competing with postgres' own caching.
Are you also setting a max limit for the ARC? You don't want postgres and the zfs ARC to compete for memory. I wonder if this impacts FreeBSD's poor performance in the read intensive tests.
I don’t think it’s linux raid software overhead but on the fly compression decompression of ZFS which could mean the BSDs are doing less IO than their Linux counterparts. I don’t know for sure it’s just a hunch.
I've done plenty of benchmarks with LZ4 and GZIP compression. It really helps a lot, especially LZ4 should always be enabled as it is always faster. GZIP is fantastic for some datasets where you do sequential scans of many tables around 10MB - 100MB'ish sizes, but should be used with care.
The great thing about PostgreSQL is using tablespaces, so you can put tables and indexes on different ZFS filesystems, with different hardware(disk vs ssd), recordsizes and compression options.
Before the benchmark I did some test with FreeBSD+ZFS with LZ4 on/off. The LZ4 CPU overhead was pretty negligible and performance was slightly better than without compression, due to lower IO usage. That's the reason I've chosen LZ4 for the actual benchmark.
The benchmark is a bit arbitrary, there are just too many variables. Filesystems (ZFS/ext4), RAID (ZFS Raid vs Linux Software Raid), compression (ZFS lz4 vs raw), IO-Scheduler (CFS, deadline, ULE). Also it is not clear if the Postgres dataset includes the data and transaction logs (WAL).
Wow! The difference in RAID between FreeBSD and the GNU Linux flavors is more than a little bit shocking on the read/write tests. I would not have expected it to be that significant.
I wondered the same thing. Best explanation had to do with the 32 GB dataset being on the system with 32 GB of RAM vs the 200 GB dataset on the system with 256 GB of RAM. Makes sense, but isn’t clear from the article’s perspective.
Probably different postgresql.conf-settings. For example, 32GiB instance has "effective_cache_size = 24GB", while 200GiB instance has "effective_cache_size = 144GB"
The fs of choice for a database workload in centos 7 should be xfs because it's one of the defaults (specially in public clouds) and because it can be grown online. I suggest you add it.
I wasn’t expecting such a drastic difference between FreeBSD and the linux distros. I wonder if this is a general BSD thing, or if it was FreeBSD specific (or filesystem or drivers).
Note that FreeBSD in the benchmark is running with LZ4 ZFS compression on which works like a speed miracle for read only database tests. ZFS with compression doubles the read bandwidth of the disks will close to double the database benchmark speed.
In Linux one would need to benchmark BTRFS with LZO compression to benchmark compete against FreeBSD ZFS with compression.
That's specifically incorrect: the results showed that the compressed ZFS was slower on pgbench and faster on TPC-B.
It's also generically an unwise assumption because you have to know how fast the compression is compared to disk I/O. If, like almost all servers these days, you have more CPU than I/O capacity the compression overhead will often be buried in the I/O latency and if the data compresses well it it's easy to have it be faster because the I/O savings is greater than the CPU.
Since LZ4 was designed to be very fast, that seems like a reasonable bar to hit — I see single-core performance on old desktops in the 1.8+GB/s range.
I find it hard to believe that many would use SW RAID with Postgres in production, especially under load conditions that were tested for here. I have seen little hope for performant SW RAID on Linux in the past twenty years. It would be interesting to see the same tests run using stable HW RAID of some sort (preferably, with SSDs)
If you want the ability to scrub (in such a way that detects errors over the whole storage stack), you must sacrifice hardware raid. You must also pay the fletcher/sha256 tax.
ZFS enthusiasts advocate that critical data absolutely requires those sacrifices. I'm inclined to agree.
I suppose, it's worth nothing to read an article, which compares benchmarks of Ubuntu and CentOS. It's too bullshit. Be honest, compare comparable things.
A few thoughts:
* You weren't testing OSes which the subject implied, you were testing Linux kernel variants and their stock OS configurations/kernel scheduler setups, and FreeBSD was tossed into the mix. Whether you are running Ubuntu, CentOS, Debian or whatever you should have the Linux kernel tuned to perform well, so adding the distribution as a variable is just a red herring. I'd be more interested in removing that variable and comparing different storage configurations (such as XFS, and LVM).
* Clients connecting over the network adds a huge variable at play (the network) -- ideally you would want to remove this.
* I may have missed it, but it wasn't clear if you had a warmup period to your benchmarks. Especially with a system like ZFS which has COW, you need to do a few benchmarks on the same blocks first, to break past the cache.