The whole piece seems to be based on a false premise, namely that MapReduce is supposed to replace databases. That's not the case at all, it's a way to analyze and transform data in parallel. Afterwards, you can load it into a relational (or other) database if you want database features.
Guess the only valid point is 3: "Not novel at all — it represents a specific implementation of well known techniques developed nearly 25 years ago"
Probably true..
The Map/Reduce paradigm was being talked about in papers about functional programming languages in the early 1980's.
There really are cases where a "full scan" is the fastest way to do something, and, when it works, sequential I/O can be orders of magnitude faster than the random access I/O used when you've got indexes -- particularly if you have to create the index in order to do your job. I've written systems that process hundreds of millions of facts, and I can do a "full scan" of these in 20 minutes on an ordinary desktop computer whereas it takes about 4 days to load these into an index in mysql or an RDF database.
Now we know that it's possible to parallelize SQL databases quite a bit, and commercial products are there, which leaves two questions for extra credit: (i) why do the "cool kids" completely ignore these commercial products, and (ii) why are there no Open Source projects in this direction?
> Guess the only valid point is 3: "Not novel at all — it represents a specific implementation of well known techniques developed nearly 25 years ago" Probably true..
Unfortunately, the USPTO does not agree. MapReduce was patented in 2010.
Yeah, that made me chuckle. Isn't the actual point, which the authors seem to at best skate over, that MapReduce scales (relatively) trivially to petabytes of data?
The article is from 2008. Since then, the so called parallel DB goes no where, and Hadoop takes over the world.
The main problem of traditional (OLAP) DBMS in the era of big data is that ETL (Extract/Transform/Load) becomes the main bottle neck rather than complex queries, as big data is inherently semi-structured and noisy. MR is the tool to process big data.
Not well researched. For instance: Map-Reduce is not used to index data or to replace indexing. This is just one point. There are so many wrong assumptions in this article, that I don't know where to begin.
"We'd never bring Hadoop code into one of our products," said Microsoft's David J. DeWitt. DeWitt is an academic expert in parallel SQL databases.
DeWitt says that in MapReduce "schema is buried" and furthermore, "the programmer must discover the structure by an examination of the code. Not only is this a very tedious exercise, but also the programmer must find the source code for the application."
But he does make one important point - "whether a DBMS should be written:
a. By stating what you want ( Relational DBMS )
b. By presenting an algorithm for data access ( Codasyl, MapReduce)
Well, mathematically speaking, 'a' wins hands down since you've normalized the data & have "no garbage in the data set" ( DeWitt's terminology ) . However, once you have a fast-enough access, potentially infinite memory to handle thousands of columns & millions of rows, then there's no reason not to atleast try to do a Codasyl. In that respect, Codasyl is like say a bubblesort. If you're going to be sorting atmost 10 elements a million times in your application, you would be better off with a quick and dirty bubblesort which actually performs faster in this particular case than a correctly written MergeSort which will do a lot better if you have a million elements, but will perform poorly if you have just 10 elements ( O(1n^2) = 100, O(510*ln(10)) = 115 ). When I first learnt DBMSs in school, the Professor actually made this very point - "someday we'll attempt a Codasyl, just not right now." Well, that day has come.
completely missing the point. map reduce didn't come to replace databases. it takes on tasks that databases are incapable of doing. google's search operation would be impossible to serve sanely with rdbms.
Also, at least Hadoop offers a natural way of dealing with skew, partitioning: http://developer.yahoo.com/hadoop/tutorial/module5.html#part....