MapReduce: A major step backwards?

thomas11 · on May 26, 2011

The whole piece seems to be based on a false premise, namely that MapReduce is supposed to replace databases. That's not the case at all, it's a way to analyze and transform data in parallel. Afterwards, you can load it into a relational (or other) database if you want database features.

Also, at least Hadoop offers a natural way of dealing with skew, partitioning: http://developer.yahoo.com/hadoop/tutorial/module5.html#part....

gizzlon · on May 26, 2011

yeah, it doesn't make a lot of sense..

Guess the only valid point is 3: "Not novel at all — it represents a specific implementation of well known techniques developed nearly 25 years ago" Probably true..

PaulHoule · on May 26, 2011

The Map/Reduce paradigm was being talked about in papers about functional programming languages in the early 1980's.

There really are cases where a "full scan" is the fastest way to do something, and, when it works, sequential I/O can be orders of magnitude faster than the random access I/O used when you've got indexes -- particularly if you have to create the index in order to do your job. I've written systems that process hundreds of millions of facts, and I can do a "full scan" of these in 20 minutes on an ordinary desktop computer whereas it takes about 4 days to load these into an index in mysql or an RDF database.

Now we know that it's possible to parallelize SQL databases quite a bit, and commercial products are there, which leaves two questions for extra credit: (i) why do the "cool kids" completely ignore these commercial products, and (ii) why are there no Open Source projects in this direction?

lurker19 · on May 26, 2011

(i) because free is more accessible than expensive. (ii) because it is expensive.

skorgu · on May 26, 2011

(iii) because when it breaks you're at the mercy of the vendor to fix it.

sigil · on May 26, 2011

> Guess the only valid point is 3: "Not novel at all — it represents a specific implementation of well known techniques developed nearly 25 years ago" Probably true..

Unfortunately, the USPTO does not agree. MapReduce was patented in 2010.

http://www.google.com/patents?id=upHLAAAAEBAJ

1010011010 · on May 26, 2011

"[Note: Although the system attributes this post to a single author, it was written by David J. DeWitt and Michael Stonebraker]"

I guess their schema doesn't handle multiple authors.

rfugger · on May 26, 2011

Given the experimental evaluations to date, we have serious doubts about how well MapReduce applications can scale.

Umm... Google search?

regularfry · on May 26, 2011

Yeah, that made me chuckle. Isn't the actual point, which the authors seem to at best skate over, that MapReduce scales (relatively) trivially to petabytes of data?

vicaya · on May 26, 2011

The article is from 2008. Since then, the so called parallel DB goes no where, and Hadoop takes over the world.

The main problem of traditional (OLAP) DBMS in the era of big data is that ETL (Extract/Transform/Load) becomes the main bottle neck rather than complex queries, as big data is inherently semi-structured and noisy. MR is the tool to process big data.

geekzgalore · on May 26, 2011

Seems that the author needs to do some serious research and reading for MapReduce

js4all · on May 26, 2011

Not well researched. For instance: Map-Reduce is not used to index data or to replace indexing. This is just one point. There are so many wrong assumptions in this article, that I don't know where to begin.

peterbraden · on May 26, 2011

exactly, plus map-reduce can be combined with indexing - see CouchDB

dxbydt · on May 26, 2011

http://www.computerworld.com/s/article/9142406/Big_three_dat...

"We'd never bring Hadoop code into one of our products," said Microsoft's David J. DeWitt. DeWitt is an academic expert in parallel SQL databases.

DeWitt says that in MapReduce "schema is buried" and furthermore, "the programmer must discover the structure by an examination of the code. Not only is this a very tedious exercise, but also the programmer must find the source code for the application."

Ever heard of Hive ? http://wiki.apache.org/hadoop/Hive/GettingStarted#SQL_Operat...

But he does make one important point - "whether a DBMS should be written: a. By stating what you want ( Relational DBMS ) b. By presenting an algorithm for data access ( Codasyl, MapReduce)

Well, mathematically speaking, 'a' wins hands down since you've normalized the data & have "no garbage in the data set" ( DeWitt's terminology ) . However, once you have a fast-enough access, potentially infinite memory to handle thousands of columns & millions of rows, then there's no reason not to atleast try to do a Codasyl. In that respect, Codasyl is like say a bubblesort. If you're going to be sorting atmost 10 elements a million times in your application, you would be better off with a quick and dirty bubblesort which actually performs faster in this particular case than a correctly written MergeSort which will do a lot better if you have a million elements, but will perform poorly if you have just 10 elements ( O(1n^2) = 100, O(510*ln(10)) = 115 ). When I first learnt DBMSs in school, the Professor actually made this very point - "someday we'll attempt a Codasyl, just not right now." Well, that day has come.

mxyzptlk · on May 26, 2011

We're using both. Vertica is amazingly fast. Hadoop helps us analyze some very big data sets. I wouldn't want to lose either one.

Vitaly · on May 26, 2011

completely missing the point. map reduce didn't come to replace databases. it takes on tasks that databases are incapable of doing. google's search operation would be impossible to serve sanely with rdbms.

chrisjsmith · on May 26, 2011

Looks like "enterprise company selling expensive black box trying to direct attention to their product" to me.