Cloud MapReduce - a fast and lean alternative to Hadoop on AWS

petewarden · on Jan 25, 2010

Summary - implements MapReduce using Amazon's native cloud offerings like S3, SQS and SimpleDB instead of Hadoop's reliance on traditional OS filesystems and services.

Interesting idea. I'm wary of the reliability issues though, I'm doing a lot with SimpleDB and there's plenty of landmines.

Here's their technical paper:

http://sites.google.com/site/huanliu/cloudmapreduce.pdf

vicaya · on Jan 25, 2010

The performance evaluation vs Hadoop looks bogus to me. S3, SQS and SimpleDB run on real (non-virtualized) hardware. You need a lot more Hadoop nodes to make it comparable. At least throughput per AWS dollar should be reported.

The ~100k files inverted index test is completely unfair to Hadoop: launching 100k map task per small file is ridiculous as it's basically measuring JVM startup time. People/crawlers typically put all these pages in large map/sequence files. Hadoop would automatically launch a map task per chunk (default 128MB).

Fetching data to EC2 for computation is also a step backward. It cannot scale on large data as the cluster will become switch bound much ealier than the "kosher" map-reduce, where data locality is observed.

In any case, it's neither fast or lean (it's basically Java glue code to use S3/SQS/SimpleDB, and seems to have larger overall carbon footprint than Hadoop)