1.1B Taxi Rides Using OmniSciDB and a MacBook Pro

twoodfin · on July 29, 2020

Whenever I look at large aggregation benchmarks like this, I try to estimate cycles/value or better cycles/byte.

Take this query:

    SELECT cab_type,
           count(*)
    FROM trips
    GROUP BY cab_type;

This is just counting occurrences of distinct values from a bag of total values sized @ 1.1B.

He's got 8 cores @ 2.7GHz, which presumably can clock up for short bursts at least a bit even when they're all running all out. Let's say 3B cycles/core/second. So in .134 seconds (the best measured time) he's burning ~3.2B cycles to aggregate 1.1B values, or about 3 cycles/value.

While that's ridiculously efficient for a traditional row-oriented database, for a columnar scheme as I'm sure OmniSciDB is using, it's less efficient than I might have expected.

Presumably the # of distinct cab types is relatively small, and you could dictionary-encode all possible values in a byte at worst. I'd expect opportunities both for computationally friendly compact encoding ("yellow" is presumably a dominant outlier and could make RLE quite profitable) and SIMD data parallel approaches that should let you roll through 4,8,16 values in a cycle or two.

Even adding LZ4 should only cost you about a cycle a byte.

That's not to denigrate OmniSciDB: They're already several orders of magnitude better than traditional database solutions, and plumbing all the way down from high-level SQL to bit twiddling SIMD is no small feat. More that there's substantial headroom to make systems like this even faster, at least until you hit the memory bandwidth wall.

alexbaden · on July 29, 2020

(disclaimer: I work for OmniSci)

I think this is a good point. On GPUs, SIMT is effectively automatic vectorization, so our focus has been on the memory bandwidth wall (we make use of cuda shared memory in nvidia GPU mode for aggregates like the above query). Non-random access compression on GPUs also has been a nonstarter, at least historically. With more recent GPUs and more recent versions of CUDA, perhaps this is changing. But on CPUs, we have started looking into vectorization. There is a tradeoff, though -- the vectorization LLVM passes do add time to the compilation phase, and at subsecond query speeds that time isn't always worth it.

There are also a few other tricks to get closer to roofline performance. If you sort the input data on the key you're grouping by you can see small performance improvements, mostly from better cache locality. But, part of the "magic" of OmniSciDB is that you can group on any key and get good performance without ingesting, reindexing, etc.

bluestreak · on July 29, 2020

I wonder what does this query compile to? In terms of C-code equivalent?

   SELECT cab_type, count(*)
   FROM trips
   GROUP BY cab_type;

From execution time it seems to me that this is a straight sum() of 32-bit integers. "cab_type" has two distinct values and if stored 32bit value for "green" is 0 and "yellow" is 1, straight sum of these integers will produce the desired outcome and explain performance. That said the same performance will not extend to key that has three or more distinct values.

nikita · on July 29, 2020

In a naive case it compiles to a loop over all the elements and hash table prob for each element. Now the magic comes from a few observations:

- cab_type has very few distinct values. so you can encode those values from 1 .. N and use an array of size N instead of the hash table

- you can build a “parallel scan”: split the rows evenly across many threads and each thread processes it’s on chunk

- the operation per row is very basic: you need to look up in the array and increment a value. so you can use SIMD to perform operations on multiple rows at the same time

- using some bit manipulation magic you can do the above on “encoded values”: you never need to convert cab_type bit represetation to an integer from 1..N

bluestreak · on July 30, 2020

Thanks for the explanation Nikita. Any branching, hashing etc will increase execution time. The execution example is on laptop, which has 2 memory channels. 0.13s is absolute max that this laptop is able to muster as far as memory throughput goes. Usually 4 threads is enough to saturate memory.

I have written this piece of code: https://github.com/questdb/questdb/blob/master/core/src/main...

This sums 64bit values and using AVX2 it will sum 1Bn in 0.26s. Incrementing conditionally will not be as fast and will throw vectorization out of the window too.

nikita · on July 30, 2020

This sounds about right. The query is using count and not sum which i believe can be a bit faster.

Yes. Branching will absolutely hurt. Good old x100 paper teaches how to avoid branching: http://cidrdb.org/cidr2005/papers/P19.pdf.

And of course there is no branching in MemSQL for this use case. And also no hashing b/c number of groups is small and you can use an array and not a hashtable.

Finally if you compress data rather than do the sum on an uncompressed array you will have a lot more compact data representation which would allow you not hit the memory bandwidth ceiling this quickly (4 threads)

tmostak · on July 30, 2020

Hi @nikita, good to reconnect.

When you say an array and not a hash table, do you just mean a simple perfect hash table indexed by the offset of the dictionary id? We use this fairly extensively for inputs of bounded domain (i.e. dictionary-encoded strings, moderately-sized integer ranges, even binned values, numeric or timestamp), but call it a perfect hashing. Assume we're talking about the same thing but wanted to clarify.

nikita · on July 30, 2020

Yes, that’s it.

I’m still of an opinion that it’s important to demonstrate performance on more complex queries with joins, subqueries, subselects, and clustered data movements. The count(*), group by query is a very very simple case.

AdamProut · on July 29, 2020

As someone who has built a columnstore database and tested it on that query shape (group-by over a column with only a few distinct values) Its possible to go faster than 1 cycle per row via SIMD and operating directly on compressed data (0.87 cycles per row). This blog post gives some details on how it’s done (https://www.memsql.com/blog/how-to-process-trillion-rows-per...)

hugh-avherald · on July 30, 2020

But for the inferences you've made (e.g. the # of distinct cab types is relatively small) you need knowledge of the whole data. What if someone uses the wrong column name? Getting the correct summary of data quickly is easy -- if you already know the answer.

Something like `count(*)` needs to work well where you have no idea at all about the data.

MuffinFlavored · on July 29, 2020

What would it take to get it to be 1 cycle/value? Does 1 cycle here mean 1 instruction? If not, how many instructions does it allow? It's just a matter of copying (in user-space I'm guessing) memory around, right?

twoodfin · on July 30, 2020

Modern superscalar CPUs with multi-layer caches, hyperthreading, vector extensions, ... have an incredibly complex, dynamic relationship between cycles and instructions executed.

For any particular implementation of a tight inner loop like this, you could measure IPC via internal counters and it would be quite consistent, but it’s really hard to estimate without them.

And does it matter? Ultimately you care about how many cores at how many GHz you need to get the job done.

tmostak · on July 29, 2020

For those wanting to try it for themselves, we recently released a preview of our full stack for Mac (containing both OmniSciDB as well as our Immerse frontend for interactive visual analytics), available for free here: https://www.omnisci.com/mac-preview. This is a bit of an experiment for us, so we'd love your feedback! Note that the Mac preview doesn't yet have the scalable rendering capabilities our platform is known for, but stay tuned.

You can also install the open source version of OmniSciDB, either via tar/deb/rpm/Docker for Linux (https://www.omnisci.com/platform/downloads/open-source) or by following the build instructions for Mac in our git repo: https://github.com/omnisci/omniscidb (hopefully will have standalone builds for Mac up soon). You can also run a Dockerized version on your Mac, but as a disclaimer the performance, particularly around storage access, lags a bare metal install.

rileytg · on July 30, 2020

i have zero experience authoring brew packages but as a consumer, a ‘brew install omniscidb’ would be really nice. especially with things like database that you might build apps that depend on it, it’s convenient to have a tool manage versions installed. it’s also very nice to have a common interface for me to manage these things. i use brew for mysql postgres redis qt rbenv node and more. fitting into that ecosystem makes it easier for me to bring this in as a dependency as it just another brew dev dependency.

ra · on July 29, 2020

There's an old issue in your Github repo where you suggest that you may support OpenCL for GPU.

Is this on your roadmap?

tmostak · on July 30, 2020

Unfortunately it's probably not in the cards in the near term just do to other priorities and insufficient demand (plus alternatives like HIP for AMD). I will say a lot of us here at OmniSci would kill to leverage the latent GPUs in our Macs and other places, so we'd welcome any community help towards this end (it's not a trivial thing to add, but also not particularly difficult either, just work).

We'll plan to update that issue with the above.

hodgesrm · on July 29, 2020

It would be great to understand why OmniSciDB does so well on this benchmark but seems to do far less well on others.

The ClickHouse team was (obviously!) very interested in Mark's result and tried out OmniSciDB on the standard analytics benchmark that CH uses to check performance. Results are here: https://presentations.clickhouse.tech/original_website/bench...]

Anyway, really intriguing results from Mark. Looking forward to learning more about the source of the differences.

Disclaimer: I work at Altinity, which supports ClickHouse.

Edit: Fixed bad link

chicagobuss · on July 29, 2020

Agreed. I'm also a bit frustrated that he never tuned vertica in his test of it - he just loaded the data into the default superprojection and queried it like that. Nearly all of vertica's power comes from its concept of projections - just like normal DBs benefit from indexes. I'd really like to see how it performs to these other systems once it's been tuned properly because in my experience it's always been the fastest DB I've ever worked with.

https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/Gl...

https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/Ad...

lmeyerov · on July 30, 2020

You may find the recent RAPIDS open source community GPU stack benchmark on TPCx-BB relevant, where they beat out the rest of the tools at something like 2x cheaper hw for 44x faster. It's an interesting industry benchmark b/c requires handling diverse use cases and out-of-core optimizations (eg, 1TB going through a 16GB GPU, and combo of matrix, tables, etc), so just say BlazingSQL or cuML won't really work, but all together covered it.

hodgesrm · on July 29, 2020

(To make this link work press the "100 mln." button when you arrive. It seems there is an encoding issue on this link that causes it not work correctly when copied to some sites.)

skavi · on July 29, 2020

I’m almost entirely sure Litwintschik is misinformed in regards to the GPUs in his laptop.

Yes, he does have the Intel GPU he mentioned, but if he paid $200 to upgrade the GPU as he claims, he would also have a dedicated AMD Radeon Pro 5500M 8GB.

moogleii · on July 29, 2020

Actually, I don't think it's possible to configure a 16" MBP without a discrete GPU - they all come with AMDs. Only the 13" MBPs can be configured without one, but he says he's using a 16".

vkrishnamurthy · on July 29, 2020

His other benchmarks of OmniSciDB that ran on systems with Nvidia GPUs, but I think he's pointing out that the in this case, the AMD GPU wasnt used by the DB engine even if it was part of the system config.

skavi · on July 29, 2020

Yeah, I just checked and it appears to be a CUDA project, so neither the Intel nor AMD GPUs help here.

Still, that could have been clarified in the article.

tmostak · on July 29, 2020

Just to clarify, most of the query engine is built around LLVM-based JIT compilation, and CUDA is not really used per say except for GPU-specific operators like atomic aggregates and thread synchronization, and of course we use the driver API to manage the GPUs, allocate memory, etc. Supporting AMD GPUs or the upcoming Intel Xe GPUs (or frankly anything that has an LLVM-backend) would not be particularly hard, it would just require adding similar supporting infra.

bluedino · on July 29, 2020

It was actually possible during the 2013-2015 Retina 15” models.

bluedino · on July 29, 2020

He may have been going by "About this Mac", which will show the Intel GPU if you're not plugged into an external monitor, or using a program that activates the dedicated GPU.

marklit · on July 29, 2020

Sorry, I just cleaned that paragraph up. Cheers for the heads up.

skavi · on July 29, 2020

Have you updated the webpage yet? I’m still seeing the same paragraph.

xiaodai · on July 29, 2020

I once contacted the author to check out my open source package and benchmark it and he mentioned that he actually charges for the benchmarking exercise. So yeah.

wenc · on July 29, 2020

I can't speak to your specific situation, but I grew up around consultants.

Consultants have a rule: "Never say No. Always say: this is how much it will cost".

It's a "no" but with a threshold.

EsotericAlgo · on July 29, 2020

The implication is that this article was paid for by OmniSciDB or someone with an interest. This changes the presumed context and should be noted upfront by the author if that is the case.

bdcravens · on July 29, 2020

Perhaps, though I suspect he didn't start his benchmark series that way. Once it got popular he may have gotten more requests (for essentially free marketing) and need a paygate.

His articles are extremely well documented, so there's nothing stopping anyone from performing the same benchmarks.

wenc · on July 29, 2020

I don't know if I'd necessarily draw that implication -- I have no data; my broader point was more around the fact that consultants have a different way of saying "no" than most people.

People can elect to work on things voluntarily for personal reasons, but if someone asks them to fix or do something for them, then instead of saying no, they might say "how much will you pay me?" It's their time and they're under no obligation to offer it to you for free (though they can voluntarily choose to if they want).

EsotericAlgo · on July 29, 2020

That's a valid point. It's possible everything else was on his own time and it was just his way of saying no. I end up reading the "how much will you pay me?" as suggesting other things posted may be paid. That's not a fair read.

noisem4ker · on July 29, 2020

Either OnniSciDB or Apple, judging from the product placement right in the title.

vkrishnamurthy · on July 29, 2020

or the Taxi commission in NYC, perhaps?

evanriley · on July 29, 2020

I feel like I've missed why he would do that for free...

aj7 · on July 29, 2020

Well, it’s work and expertise on his part, is it not?

angryyellowman · on July 29, 2020

>The GPU won't be used by OmniSciDB in this benchmark but for the record it's an Intel UHD Graphics 630 with 1,536 MB of GPU RAM. This GPU was a $200 upgrade over the stock GPU Apple ships with this notebook. Nonetheless, it won't have a material impact on this benchmark.

He lost me here...I get that it doesn't matter, but come on, if you don't know that your computer has a GPU other than the integrated graphics (that you admit you paid more to upgrade) then what are you really doing...

jeroenhd · on July 29, 2020

Another comment suggested that he got this information from the "about my mac" popup, which only shows the dedicated GPU when connected to an external display or when using an application that uses the dedicated GPU.

If he ran his benchmarks and then checked his hardware while writing this article then the author might've gotten confused by that.

danso · on July 29, 2020

    COPY trips
    FROM '/Users/mark/taxi_csv/*.gz'
    WITH (HEADER='false');

> The above managed to complete in 31 minutes and 40 seconds. The resulting import produced 294 GB of data in OmniSciDB's internal format.

I’m really curious how a simple import (no indexes or data typing) into SQLite would compare. But I don’t have 700GB of free SSD space to spare.

tmostak · on July 29, 2020

Mark did a benchmark of SQLite using its internal file format a few years ago (https://tech.marksblogg.com/billion-nyc-taxi-rides-sqlite-pa...), clocking the import at 5.5 hours. It looks like this was done though on a spinning disk, so given a proper SSD, and a newer version of SQLite, it might be much faster.

momonga · on July 29, 2020

The same blog has other benchmarks[1], one of which is SQLite[2].

[1] https://tech.marksblogg.com/benchmarks.html

[2] https://tech.marksblogg.com/billion-nyc-taxi-rides-sqlite-pa...

wenc · on July 29, 2020

For a list of benchmarks by the same author.

https://tech.marksblogg.com/benchmarks.html

Caveat: these benchmarks only test the simplest of operations like aggregation (GROUP BY, COUNT, AVG) and sorts (ORDER BY). No JOINs or window operations are performed. Even basic filtering (WHERE) doesn't seem to have been tested. YMMV.

x87678r · on July 29, 2020

Wow click house is right up there.

nwsm · on July 29, 2020

They're all on different hardware. Are they really that comparable?

bdcravens · on July 29, 2020

No, but I think each piece of software is put in a proper context, to match what most would commonly use in that particular use case. For example, the Clickhouse benchmarks are run against typical modest cloud instances.

tmostak · on July 29, 2020

To be fair, the c5d.9xlarge instances are $1.728 each per hour, or $5.18 for the 3-server cluster (looks to be about $3.06/hr for reserved 1-year pricing). Even with reserved pricing, that's $26,806 a year, or 6.5X more than a $4K laptop that likely will last for years and would be bought anyway (or at least a cheaper variant, which would also run these queries nearly as quickly). Of course that's very apples-to-oranges, so another way to look at this is that OmniSci would probably see significantly better performance on a single c5d.9xlarge than what we saw on this Mac (would need to benchmark, but informally I can say that OmniSci was 2-3X faster running on CPU on my Linux workstation compared to my Mac).

Disclaimer: No disrespect to ClickHouse here, it's an amazing system that I'm sure beats out OmniSci for certain workflows.

shaklee3 · on July 30, 2020

The biggest loss for omnisci was the in-memory limits. The highest end GPUs have 32GB, while you can find CPU servers with multiple TB. As soon as you spill out of that you take a big performance hit.

rexreed · on July 29, 2020

Side note: you can see a cool demo of OmniSci at https://www.aidemoshowcase.com/2020/07/08/omnisci/

pvtmert · on July 29, 2020

I wonder why nobody uses -C and --strip options of tar. It handles those stuff automatically.

    mkdir -p application
    tar --strip=1 -C application -xf archive.tar

For example installing NodeJS from archive when composed with curl:

    OS=$(uname -s)-x64
    VER=12.16.1
    curl -#L "https://nodejs.org/dist/v${VER}/node-v${VER}-${OS,,}.tar.gz" \
    | sudo tar --strip=1 -xzC /usr/local

IMHO thats probably the reason why most apps are having proprietary installers :)

zX41ZdbW · on July 29, 2020

Follow-up: https://news.ycombinator.com/item?id=23990844

coolgeek · on July 30, 2020

> https://tech.marksblogg.com/omnisci-macos-macbookpro-mbp.htm...

Is this what a keyword-stuffed URL looks like? This is terrible! This does nothing at all to communicate semantic meaning about the post's content

popotamonga · on July 30, 2020

Data does not fit in Ram so i guess in the end its about file formats and minimizing disk access, thats why some of the competition benchmarksbare terrible no?

antb123 · on July 30, 2020

I wonder how it compares with a similar pandas instance?

innocenat · on July 29, 2020

Someone shout put 'analytic' in the title. It make no sense to me right now.