Hacker News new | past | comments | ask | show | jobs | submit login

We're currently running Impala in production with a table that currently has over 4.5B rows which powers an internal log analysis website. We don't have any hard limitations for concurrent queries, and no vacuuming since the data lives within Hadoop. We've been pretty happy with it so far.



What kind of performance are you getting, and how many nodes do you have?


We have a 14 node cluster, the nodes have anywhere between 4-6 disks. Performance has been pretty amazing, we can do ad-hoc queries on this 4.5B row table. Each node has read throughput at about ~1.3GB/s for full table scans (data is snappy compressed, store as RCFile: columnar).


That sounds pretty fantastic - when you say "ad-hoc" do you mean that it's fast enough to be directly queried from a UI - are we talking seconds or minutes for your queries?

What drawbacks have you found with Impala? I've been keeping an eye on it, and also Shark: http://shark.cs.berkeley.edu/


It depends, if you plan to scan our entire data set it could take 30-40 seconds (roughly ~2.8TB), but we have our data partitioned based on a key that makes sense for the kind of data you'd need to populate a web page and these queries are fast enough (< 2 seconds) for aggregations that come in via AJAX.

We haven't yet had a chance to optimize our environment either. For example, our nodes are still running a pretty old version of CentOS, so we have LLVM disabled (which would help a lot for huge batch computations...see http://blog.cloudera.com/blog/2013/02/inside-cloudera-impala...).

Also, our data is stored in RCFile, which is not exactly the most optimized columnar storage format. We're working on a plan to get everything over the new Parquet (http://parquet.io/) columnar format for another boost in performance.

We haven't come across any real drawbacks using Impala as of yet, it fits our needs pretty well.

Disclaimer: I work for Cloudera in their internal Tools Team, we like to dog food our stuff :).

Edit: One drawback of Impala is the lack of UDF support, but this is something that will be coming in a later release.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: