Whenever I look at large aggregation benchmarks like this, I try to estimate cycles/value or better cycles/byte.
Take this query:
SELECT cab_type,
count(*)
FROM trips
GROUP BY cab_type;
This is just counting occurrences of distinct values from a bag of total values sized @ 1.1B.
He's got 8 cores @ 2.7GHz, which presumably can clock up for short bursts at least a bit even when they're all running all out. Let's say 3B cycles/core/second. So in .134 seconds (the best measured time) he's burning ~3.2B cycles to aggregate 1.1B values, or about 3 cycles/value.
While that's ridiculously efficient for a traditional row-oriented database, for a columnar scheme as I'm sure OmniSciDB is using, it's less efficient than I might have expected.
Presumably the # of distinct cab types is relatively small, and you could dictionary-encode all possible values in a byte at worst. I'd expect opportunities both for computationally friendly compact encoding ("yellow" is presumably a dominant outlier and could make RLE quite profitable) and SIMD data parallel approaches that should let you roll through 4,8,16 values in a cycle or two.
Even adding LZ4 should only cost you about a cycle a byte.
That's not to denigrate OmniSciDB: They're already several orders of magnitude better than traditional database solutions, and plumbing all the way down from high-level SQL to bit twiddling SIMD is no small feat. More that there's substantial headroom to make systems like this even faster, at least until you hit the memory bandwidth wall.
I think this is a good point. On GPUs, SIMT is effectively automatic vectorization, so our focus has been on the memory bandwidth wall (we make use of cuda shared memory in nvidia GPU mode for aggregates like the above query). Non-random access compression on GPUs also has been a nonstarter, at least historically. With more recent GPUs and more recent versions of CUDA, perhaps this is changing. But on CPUs, we have started looking into vectorization. There is a tradeoff, though -- the vectorization LLVM passes do add time to the compilation phase, and at subsecond query speeds that time isn't always worth it.
There are also a few other tricks to get closer to roofline performance. If you sort the input data on the key you're grouping by you can see small performance improvements, mostly from better cache locality. But, part of the "magic" of OmniSciDB is that you can group on any key and get good performance without ingesting, reindexing, etc.
I wonder what does this query compile to? In terms of C-code equivalent?
SELECT cab_type, count(*)
FROM trips
GROUP BY cab_type;
From execution time it seems to me that this is a straight sum() of 32-bit integers. "cab_type" has two distinct values and if stored 32bit value for "green" is 0 and "yellow" is 1, straight sum of these integers will produce the desired outcome and explain performance. That said the same performance will not extend to key that has three or more distinct values.
In a naive case it compiles to a loop over all the elements and hash table prob for each element. Now the magic comes from a few observations:
- cab_type has very few distinct values. so you can encode those values from 1 .. N and use an array of size N instead of the hash table
- you can build a “parallel scan”: split the rows evenly across many threads and each thread processes it’s on chunk
- the operation per row is very basic: you need to look up in the array and increment a value. so you can use SIMD to perform operations on multiple rows at the same time
- using some bit manipulation magic you can do the above on “encoded values”: you never need to convert cab_type bit represetation to an integer from 1..N
Thanks for the explanation Nikita. Any branching, hashing etc will increase execution time. The execution example is on laptop, which has 2 memory channels. 0.13s is absolute max that this laptop is able to muster as far as memory throughput goes. Usually 4 threads is enough to saturate memory.
This sums 64bit values and using AVX2 it will sum 1Bn in 0.26s. Incrementing conditionally will not be as fast and will throw vectorization out of the window too.
And of course there is no branching in MemSQL for this use case. And also no hashing b/c number of groups is small and you can use an array and not a hashtable.
Finally if you compress data rather than do the sum on an uncompressed array you will have a lot more compact data representation which would allow you not hit the memory bandwidth ceiling this quickly (4 threads)
When you say an array and not a hash table, do you just mean a simple perfect hash table indexed by the offset of the dictionary id? We use this fairly extensively for inputs of bounded domain (i.e. dictionary-encoded strings, moderately-sized integer ranges, even binned values, numeric or timestamp), but call it a perfect hashing. Assume we're talking about the same thing but wanted to clarify.
I’m still of an opinion that it’s important to demonstrate performance on more complex queries with joins, subqueries, subselects, and clustered data movements. The count(*), group by query is a very very simple case.
As someone who has built a columnstore database and tested it on that query shape (group-by over a column with only a few distinct values) Its possible to go faster than 1 cycle per row via SIMD and operating directly on compressed data (0.87 cycles per row). This blog post gives some details on how it’s done (https://www.memsql.com/blog/how-to-process-trillion-rows-per...)
But for the inferences you've made (e.g. the # of distinct cab types is relatively small) you need knowledge of the whole data. What if someone uses the wrong column name? Getting the correct summary of data quickly is easy -- if you already know the answer.
Something like `count(*)` needs to work well where you have no idea at all about the data.
What would it take to get it to be 1 cycle/value? Does 1 cycle here mean 1 instruction? If not, how many instructions does it allow? It's just a matter of copying (in user-space I'm guessing) memory around, right?
Modern superscalar CPUs with multi-layer caches, hyperthreading, vector extensions, ... have an incredibly complex, dynamic relationship between cycles and instructions executed.
For any particular implementation of a tight inner loop like this, you could measure IPC via internal counters and it would be quite consistent, but it’s really hard to estimate without them.
And does it matter? Ultimately you care about how many cores at how many GHz you need to get the job done.
For those wanting to try it for themselves, we recently released a preview of our full stack for Mac (containing both OmniSciDB as well as our Immerse frontend for interactive visual analytics), available for free here: https://www.omnisci.com/mac-preview. This is a bit of an experiment for us, so we'd love your feedback! Note that the Mac preview doesn't yet have the scalable rendering capabilities our platform is known for, but stay tuned.
You can also install the open source version of OmniSciDB, either via tar/deb/rpm/Docker for Linux (https://www.omnisci.com/platform/downloads/open-source) or by following the build instructions for Mac in our git repo: https://github.com/omnisci/omniscidb (hopefully will have standalone builds for Mac up soon). You can also run a Dockerized version on your Mac, but as a disclaimer the performance, particularly around storage access, lags a bare metal install.
i have zero experience authoring brew packages but as a consumer, a ‘brew install omniscidb’ would be really nice. especially with things like database that you might build apps that depend on it, it’s convenient to have a tool manage versions installed. it’s also very nice to have a common interface for me to manage these things. i use brew for mysql postgres redis qt rbenv node and more. fitting into that ecosystem makes it easier for me to bring this in as a dependency as it just another brew dev dependency.
Unfortunately it's probably not in the cards in the near term just do to other priorities and insufficient demand (plus alternatives like HIP for AMD). I will say a lot of us here at OmniSci would kill to leverage the latent GPUs in our Macs and other places, so we'd welcome any community help towards this end (it's not a trivial thing to add, but also not particularly difficult either, just work).
Agreed. I'm also a bit frustrated that he never tuned vertica in his test of it - he just loaded the data into the default superprojection and queried it like that. Nearly all of vertica's power comes from its concept of projections - just like normal DBs benefit from indexes. I'd really like to see how it performs to these other systems once it's been tuned properly because in my experience it's always been the fastest DB I've ever worked with.
You may find the recent RAPIDS open source community GPU stack benchmark on TPCx-BB relevant, where they beat out the rest of the tools at something like 2x cheaper hw for 44x faster. It's an interesting industry benchmark b/c requires handling diverse use cases and out-of-core optimizations (eg, 1TB going through a 16GB GPU, and combo of matrix, tables, etc), so just say BlazingSQL or cuML won't really work, but all together covered it.
(To make this link work press the "100 mln." button when you arrive. It seems there is an encoding issue on this link that causes it not work correctly when copied to some sites.)
I’m almost entirely sure Litwintschik is misinformed in regards to the GPUs in his laptop.
Yes, he does have the Intel GPU he mentioned, but if he paid $200 to upgrade the GPU as he claims, he would also have a dedicated AMD Radeon Pro 5500M 8GB.
Actually, I don't think it's possible to configure a 16" MBP without a discrete GPU - they all come with AMDs. Only the 13" MBPs can be configured without one, but he says he's using a 16".
His other benchmarks of OmniSciDB that ran on systems with Nvidia GPUs, but I think he's pointing out that the in this case, the AMD GPU wasnt used by the DB engine even if it was part of the system config.
Just to clarify, most of the query engine is built around LLVM-based JIT compilation, and CUDA is not really used per say except for GPU-specific operators like atomic aggregates and thread synchronization, and of course we use the driver API to manage the GPUs, allocate memory, etc. Supporting AMD GPUs or the upcoming Intel Xe GPUs (or frankly anything that has an LLVM-backend) would not be particularly hard, it would just require adding similar supporting infra.
He may have been going by "About this Mac", which will show the Intel GPU if you're not plugged into an external monitor, or using a program that activates the dedicated GPU.
I once contacted the author to check out my open source package and benchmark it and he mentioned that he actually charges for the benchmarking exercise. So yeah.
The implication is that this article was paid for by OmniSciDB or someone with an interest. This changes the presumed context and should be noted upfront by the author if that is the case.
Perhaps, though I suspect he didn't start his benchmark series that way. Once it got popular he may have gotten more requests (for essentially free marketing) and need a paygate.
His articles are extremely well documented, so there's nothing stopping anyone from performing the same benchmarks.
I don't know if I'd necessarily draw that implication -- I have no data; my broader point was more around the fact that consultants have a different way of saying "no" than most people.
People can elect to work on things voluntarily for personal reasons, but if someone asks them to fix or do something for them, then instead of saying no, they might say "how much will you pay me?" It's their time and they're under no obligation to offer it to you for free (though they can voluntarily choose to if they want).
That's a valid point. It's possible everything else was on his own time and it was just his way of saying no. I end up reading the "how much will you pay me?" as suggesting other things posted may be paid. That's not a fair read.
>The GPU won't be used by OmniSciDB in this benchmark but for the record it's an Intel UHD Graphics 630 with 1,536 MB of GPU RAM. This GPU was a $200 upgrade over the stock GPU Apple ships with this notebook. Nonetheless, it won't have a material impact on this benchmark.
He lost me here...I get that it doesn't matter, but come on, if you don't know that your computer has a GPU other than the integrated graphics (that you admit you paid more to upgrade) then what are you really doing...
Another comment suggested that he got this information from the "about my mac" popup, which only shows the dedicated GPU when connected to an external display or when using an application that uses the dedicated GPU.
If he ran his benchmarks and then checked his hardware while writing this article then the author might've gotten confused by that.
Mark did a benchmark of SQLite using its internal file format a few years ago (https://tech.marksblogg.com/billion-nyc-taxi-rides-sqlite-pa...), clocking the import at 5.5 hours. It looks like this was done though on a spinning disk, so given a proper SSD, and a newer version of SQLite, it might be much faster.
Caveat: these benchmarks only test the simplest of operations like aggregation (GROUP BY, COUNT, AVG) and sorts (ORDER BY). No JOINs or window operations are performed. Even basic filtering (WHERE) doesn't seem to have been tested. YMMV.
No, but I think each piece of software is put in a proper context, to match what most would commonly use in that particular use case. For example, the Clickhouse benchmarks are run against typical modest cloud instances.
To be fair, the c5d.9xlarge instances are $1.728 each per hour, or $5.18 for the 3-server cluster (looks to be about $3.06/hr for reserved 1-year pricing). Even with reserved pricing, that's $26,806 a year, or 6.5X more than a $4K laptop that likely will last for years and would be bought anyway (or at least a cheaper variant, which would also run these queries nearly as quickly). Of course that's very apples-to-oranges, so another way to look at this is that OmniSci would probably see significantly better performance on a single c5d.9xlarge than what we saw on this Mac (would need to benchmark, but informally I can say that OmniSci was 2-3X faster running on CPU on my Linux workstation compared to my Mac).
Disclaimer: No disrespect to ClickHouse here, it's an amazing system that I'm sure beats out OmniSci for certain workflows.
The biggest loss for omnisci was the in-memory limits. The highest end GPUs have 32GB, while you can find CPU servers with multiple TB. As soon as you spill out of that you take a big performance hit.
Data does not fit in Ram so i guess in the end its about file formats and minimizing disk access, thats why some of the competition benchmarksbare terrible no?
Take this query:
This is just counting occurrences of distinct values from a bag of total values sized @ 1.1B.He's got 8 cores @ 2.7GHz, which presumably can clock up for short bursts at least a bit even when they're all running all out. Let's say 3B cycles/core/second. So in .134 seconds (the best measured time) he's burning ~3.2B cycles to aggregate 1.1B values, or about 3 cycles/value.
While that's ridiculously efficient for a traditional row-oriented database, for a columnar scheme as I'm sure OmniSciDB is using, it's less efficient than I might have expected.
Presumably the # of distinct cab types is relatively small, and you could dictionary-encode all possible values in a byte at worst. I'd expect opportunities both for computationally friendly compact encoding ("yellow" is presumably a dominant outlier and could make RLE quite profitable) and SIMD data parallel approaches that should let you roll through 4,8,16 values in a cycle or two.
Even adding LZ4 should only cost you about a cycle a byte.
That's not to denigrate OmniSciDB: They're already several orders of magnitude better than traditional database solutions, and plumbing all the way down from high-level SQL to bit twiddling SIMD is no small feat. More that there's substantial headroom to make systems like this even faster, at least until you hit the memory bandwidth wall.