Hacker News new | past | comments | ask | show | jobs | submit login
AWS Redshift: How Amazon Changed The Game (aggregateknowledge.com)
187 points by rgrzywinski on May 16, 2013 | hide | past | favorite | 21 comments



This blog post is great - extremely detailed and informative. We're running Infobright (paid version) and so I love to hear from people using other technologies at large scale (> 1B rows) and how it compares to what they were using before. Redshift has definitely attracted our attention.

I listened to the recent Redshift performance webinar and was surprised by:

- the hard maximum 15 concurrent queries (14 when you reserve one for management)

- the need to pre-assign queries into queues so that fast queries don't get blocked by slower ones

- the recommendation to vacuum after large inserts (showing its Postgres roots) and to limit query concurrency to 3 while vacuuming.

Have these limitations caused problems for anyone?

Edit: formatting, grammar.


We're currently running Impala in production with a table that currently has over 4.5B rows which powers an internal log analysis website. We don't have any hard limitations for concurrent queries, and no vacuuming since the data lives within Hadoop. We've been pretty happy with it so far.


What kind of performance are you getting, and how many nodes do you have?


We have a 14 node cluster, the nodes have anywhere between 4-6 disks. Performance has been pretty amazing, we can do ad-hoc queries on this 4.5B row table. Each node has read throughput at about ~1.3GB/s for full table scans (data is snappy compressed, store as RCFile: columnar).


That sounds pretty fantastic - when you say "ad-hoc" do you mean that it's fast enough to be directly queried from a UI - are we talking seconds or minutes for your queries?

What drawbacks have you found with Impala? I've been keeping an eye on it, and also Shark: http://shark.cs.berkeley.edu/


It depends, if you plan to scan our entire data set it could take 30-40 seconds (roughly ~2.8TB), but we have our data partitioned based on a key that makes sense for the kind of data you'd need to populate a web page and these queries are fast enough (< 2 seconds) for aggregations that come in via AJAX.

We haven't yet had a chance to optimize our environment either. For example, our nodes are still running a pretty old version of CentOS, so we have LLVM disabled (which would help a lot for huge batch computations...see http://blog.cloudera.com/blog/2013/02/inside-cloudera-impala...).

Also, our data is stored in RCFile, which is not exactly the most optimized columnar storage format. We're working on a plan to get everything over the new Parquet (http://parquet.io/) columnar format for another boost in performance.

We haven't come across any real drawbacks using Impala as of yet, it fits our needs pretty well.

Disclaimer: I work for Cloudera in their internal Tools Team, we like to dog food our stuff :).

Edit: One drawback of Impala is the lack of UDF support, but this is something that will be coming in a later release.


If you like Redshift, you really should look at BitYota (http://www.bityota.com/). Launched the same day at the AWS Re:invent show as Redshift, it provides much of the performance of Redshift with much more customization, as well as a more rounded set of SQL extensions (including SQL-2003 OLAP functions and UDFs in multiple languages).

I don't work for them, but they make a good product.


One of the big advantages of redshift is the postgres protocol compatibility. Which means I don't have to rewrite code to connect to redshift and run queries. It seems that BitYota wants me to run queries via json. Really? Why would I lock my code into using something like that?


I think you are misreading the Bityota pages. On the contrary, you CAN run queries with JSON... but you can also just pass in regular SQL statements. I guess Postgres direct protocol is handy; I find that ODBC covers many of my needs, so I haven't actually looked at Bityota as a Postgres drop-in.

Might be worth playing with what Bityota offers; Postgres is awesome but sometimes makes you bend over backwards to do things that are easier in other systems. Paraccel/Redshift has some good extensions, but sometimes there are better ways to do things.


You down with MPP? You know what it means?

(Seriously, what does it mean?)


I had to look that up. Massively Parallel Processing.


What the author is trying to say is that (IMHO) running a database on a MPP (quite common these days, or easily stood up in a data center or Amazon) is still a pain.

Outside of Postgres, none of the truly open-source databases scale well with MPPs. You're still stuck with mostly single-core processing.

So you can pay good $$$$$ to the commercial database providers to provide you with a good database that may scale to your needs.

Or use Redshift for a great analytics DB at a reasonable price. (Note: we heavily use HBase where I am at, but that's comparing apples to oranges.)


You can add Impala to HBase and get a pretty good SQL based low latency analytics solution. (If your data is structured to allow impala to take advantage of row key ordering)

* Disclosure: I'm an Apache HBase committer, I've written parts of Impala's HBase integration, and I work at Cloudera.


Apples to Oranges, how so, if you don't mind me asking? What about HBase makes it different from RedShift/Greenplum/Oracle?


HBase isn't built for analysis workloads. It doesn't have a complex query engine, so you end up having to do massive scans (which aren't especially fast), transfer a ton of data to the node where your analytics code is running, and do the computation there. If things are too big to run on one machine, that's your problem, not HBase's.

On the other hand, Impala, RedShift, Oracle Exadata, etc. let you ask the database to do work at a much higher level, which allows for much better performance because the data storage and computation layers can work in tandem (so you can prune down to only the data your query needs at each storage node before hitting the network, for example), and the database does the work of optimizing for multiple cores and nodes, not the writer of the analysis routine.


I think there are still definitely workloads suitable to other data store setups like graph, tabular, etc. (at least based on performance testing i've seen in the past). On the plus side a lot of standard crappy workloads (weblog reduction) will work wonderfully in columnar setups like redshift.


The comments on pricing of alternative products were ... odd. E.g., Netezza was available for purchase, hardware of course included, at $20K/TB several years ago, counting 2.25X compression.

Well, except for the last such comment, which contradicted some of the prior ones by 1-2 orders of magnitude.


> Sure, sure there are other open-source options out there that aren’t Hadoop, but for example Impala 1.0 just came out...

Although Impala isn't technically Hadoop, it's daemons are co-located with each your Hadoop Data Nodes, and talks directly to HDFS and HBase. Also, there is no ETL process with Impala, you could drop a few billion row flat files into Hadoop and query it within a few minutes.


"..hop onto Hadoop and ride the open-source rodeo with Hive or Pig by your side." - Laughed at this sentence. Well written!


I wish Salesforce would follow. At the moment 2Gb of database storage costs nearly same as 2Tb of storage at Redshift. And yes, that doesn't include the licence itself.


Well written article! Realistic from an ops perspective but also has marginal AWS cost calculations, which are very interesting but difficult to estimate.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: