Why would you expect two different virtual machines to have identical performance?
I would expect that just the cache usage characteristics of "neighbouring" workloads alone would account for at least a 10% variance! Not to mention system bus usage, page table entry churn, etc, etc...
If you need more than 5% accuracy for a benchmark, you absolutely have to use dedicated hosts. Even then, just the temperature of the room would have an effect if you leave Turbo Boost enabled! Not to mention the "silicon lottery" that all overclockers are familiar with...
This feels like those engineering classes where we had to calculate stresses in every truss of a bridge to seven figures, and then multiply by ten for safety.
I didn't expect identical performance, but a 10~20% variance is just too big. For example, if https://www.cockroachlabs.com/guides/2021-cloud-report/ got a "slow" GCP virtual machine but a "fast" azure virtual machine, the final result could totally flip.
The more problematic scenario, as mentioned in the article, is when you need to do some sort of performance tuning that can take weeks/months to complete. On the cloud, you either have to keep the virtual machine running all the time (and hope that a live migration doesn't happen behind the scene to move it to a different physical host), and do the painful stop/start until you get back the "right" virtual machine before proceeding to do the actual work.
We discovered this variance a couple of months ago. And this article from talawah.io is actually the first time I have seen anyone else mentioning about it. It still remains a mystery, because we too can't figure out what contributes to the variance using tools like stress-ng, but the variance is real when looking at MySQL commits/s metric.
> If you need more than 5% accuracy for a benchmark, you absolutely have to use dedicated hosts.
After this ordeal, I am arriving at that conclusion as well. Just the perfect excuse to build a couple of ryzen boxes.
This is a bit like someone being mystified that their arrival time at a destination across the city is not repeatable to within plus-minus a minute.
There are traffic lights on the way! Other cars! Weather! Etc...
I've heard that Google's internal servers (not GCP!) use special features of the Intel Xeon processors to logically partition the CPU caches. This enables non-prod workloads to coexist with prod workloads with a minimal risk of cache trashing of the prod workload. IBM mainframes go further, splitting at the hardware level, with dedicated expansion slots and the like.
You can't reasonably expect 4-core virtual machines to behave identically to within 5% on a shared platform! That tiny little VM is probably shoulder-to-shoulder with 6 or 7 other tenants on a 28 or 32 core processor. The host itself is likely dual-socket, and some other VMs sizes may be present, so up to 60 other VMs running on the same host. All sharing memory, network, disk, etc...
The original article was also a network test. Shared fabrics aren't going to return 100% consistent results either. For that, you'd need a crossover cable.
Well, I'll be the first one to admit that I was naive to expect <5% variance prior to this experience. But I guess you are going to far by framing this as a common wisdom?
Of course, both the cockroachdb and mongodb cases could be related, as any performance variance at the instance level could be masked when the instances form a cluster, and the workload can be served by any node within the cluster.
You do have a point. I also have seen many benchmarks use cloud instances without any disclaimers, and it always made me raise an eyebrow quizzically.
Any such benchmark I do is averaged over a few instances in several availability zones. I also benchmark specifically in the local region that I will be deploying production to. They're not all the same!
Where the cloud is useful for benchmarking is that it's possible to spin up a wide range of "scenarios" at low cost. Want to run a series of tests ranging from 1 to 100 cores in a single box? You can! That's very useful for many kinds of multi-threaded development.
I would expect that just the cache usage characteristics of "neighbouring" workloads alone would account for at least a 10% variance! Not to mention system bus usage, page table entry churn, etc, etc...
If you need more than 5% accuracy for a benchmark, you absolutely have to use dedicated hosts. Even then, just the temperature of the room would have an effect if you leave Turbo Boost enabled! Not to mention the "silicon lottery" that all overclockers are familiar with...
This feels like those engineering classes where we had to calculate stresses in every truss of a bridge to seven figures, and then multiply by ten for safety.