I remember this paper, and was at a company at the time that was one of the first to use MapReduce, so saw this all play out first hand. I appreciated the paper. Then and now developers rush to grab new technologies, especially those that stroke their ego-driven fantasies of “working at scale” without considering their underlying constraints or applicability. At the time this was published every company and startup under the sun was rushing to use MapReduce, most often in places where it wasn’t warranted. I’m glad someone surfaced this paper again; people still need to learn the lessons that it outlines. Microservices and k8s: I’m looking straight at you.
Map reduce came to my radar around the same time the Trough of Disillusionment hit for some other things, including design patterns. We still believed in the 8 Fallacies of Distributed Computing back then, before cloud providers came along and started selling Fallacies as a Service.
I can’t wait for that hangover to hit us. Its likely to be the best one of my career.
“Wow when we get rid of all this on premise stuff and IT and outsource it all to cloud we are going to save so much money…”
LOL
Companies are now paying tens of thousands a month for compute that could run on a 10 year old physical server. They also have no control and don’t even own their data anymore in some cases.
They’re starting to realize it, but now that they’ve divested themselves of in house IT they no longer have the in house expertise to do differently so they are kinda locked in.
Well allow me to retort. I have done 2 unicorns in the last 10 years and i have 6 new projects going on. None of them would even have been possible without cloud, almost all are under 100$ a month on vercel.
I think the same lack of discipline that causes costs to overrun in the cloud causes costs to overrun in owned hardware. The only real benefit is that you can depreciate the owned hardware for tax benefit but it won't really make owned hardware help you control costs necessarily. The costs will just surface elsewhere instead.
It’s kinda the same problem, isn’t it? If I had a million dollar budget last year and I don’t have one this year, I’ve lost status. Unless the company is flaming out and then lowering my budget is a good thing for a little bit, until it isn’t.
I’ve had maybe three bosses who gave me the recognition I felt I deserved for saving us huge amounts of time and money by chasing something down that other people couldn’t even picture in their heads.
And every single one of those went on to work for a company I absolutely despise and asked me to follow them. As you can guess I have some feelings about that.
I get kudo's for this work when the word comes down that we need to cut costs or else. That is the only time I'll attempt it. If that word hasn't been delivered then any attempt to manage cost in that way is usually met with resistance and communication about "Other priorities".
There are a few places I've worked where cost cutting was valued but they were few and far between really.
Generally I appreciate this post, as yeah the bandwagon effect is real.
I'd characterize mapreduce as a very very specific narrow architectural pattern. Trying to apply it contorts the code you write. I don't see anything remotely like that that's true about Kubernetes or containers (microservices much more so in creating constraints).
We had to reset the Days Since Kubernetes Winge counter again yesterday: https://news.ycombinator.com/item?id=39868586 . And a couple of people spoke to how you might not need containers, but I still haven't heard anyone say what not having containers could win you. What types of code can you only write without containers? We can convince ourselves that Kubernetes is hard, but also lots of people also say it's easy/not bad, so there's some difficulty-factor that's unknown/variable. But I strongly struggle to see a parallel between a strong code architecture choice like MapReduce and a generic platform like Kubernetes or containers.
The platform seems pleasantly neutral in shaping what you do with it, in my view; if that wasn't true it would never have been a success.
> I still haven't heard anyone say what not having containers could win you. What types of code can you only write without containers?
Code with one less layer of abstraction. If that layer is buying you something you need, it's great. But abstraction isn't a positive in & of itself, it's why we get upset about GenericAbstractFactoryBeanSingleton(s).
Containers on kubernetes are major part in allowing me to put less abstractions into code directly, if only because I can reduce bits that would be built-in or in messy companion scripts into more or less standardised patterns in k8s
What would you change about your coding because of deploying in a container? What do you tell your junior & senior engineers to do differently?
I call bull frelling shit. It's not an abstraction. It's a deployment pattern. It doesn't affect how or what we code.
Endless shitty pointless bullshit grousing, over nothing. It doesn't actually matter. It's just hip, you get to feel good, by pretending you are dunking on sheeples. Frivolous counter-cultural motions that people use to make themselves feel advanced & intelligent. But actually, being free of this pattern doesn't really buy you anything new or different. The same code & the same approaches are viable with our without. It just feels good to be shitty to the mainstream.
If there were abstractions maybe there would be some justification for the slippery slope "oh no!" panic. And K8s is definitely some kind of abstraction. But the sloppy slippery slope "oh no abstractions" shit still pulls no weight for me, fails to acknowledge that sometimes abstraction can & is useful, allows good things.
> What do you tell your junior & senior engineers to do differently?
I tell junior & senior engineers to be careful about what assumptions they make of their environments, and to consider everything they depend on a potential liability.
> And what exactly is wrong with GraphQL ? It is better than REST for a number of use cases.
When a response is a representation of a single resource, that response has a definitive caching lifetime. When a response represents a synthesis of a bunch of random crap the user asked for, the response has no clear caching lifetime, and so is uncacheable.
Most of the scalability the web has achieved, rides on the back of response caching at one level or another. Even low-level business-backend vendor APIs can—and are often designed under the assumption and requirement of—being cached. GraphQL throws that property away.
GraphQL seems like a great idea if you're a massive social media company looking for ways to empower devs to stamp out flavor of the week social media widgets without being tied to existing APIs. I'd bet most companies doing GraphQL are just doing uncacheable things that REST APIs could do fine.
Almost all graphQL clients implement a custom caching layer (for example using indexeddb in the browser) that can cache resources with different timelines, also when they’re returned in a single response.
Yeah, but that's client-side caching, which doesn't get you much in terms of scalability. Scalability comes from transparent backend reverse-proxy caching — Varnish and the like. You can cache the resources that go into a GraphQL response, but you still have to build the GraphQL response each time — and that can become a problem, as various should-be-trivial parts of each request's backend lifecycle start living long enough during high concurrency request loads to destroy request-queue clearance rates.
Kubernetes is not like mapreduce. It does not need microservices at all. It is a scheduling and deployment framework, which you will implement yourself anyway (hopefully you do) or you use a pass. It’s not even that hard to work with it. Of course it is complex, but a lot of these tools are, even the lower level ones like terraform.
Between monoliths and microservices you have services and sidecars. If you don’t at least have sidecars I really don’t see the point of kubernetes, because most of the rest of the services will follow Conway’s Law and can reasonably do their own thing for less than 125% of the cost of full bin packing.
Things like CertManager and ExternalDNS take away so much operational time dealing with those two things alone. There’s a lot of good infra automation. That being said, I’m a lot bigger on the Fly Machine, Compute-at-Edge trend. If only they had a good IaC solution (after abandoning Terraform and the lack of features in fly.toml).
I've worked on a lot of very large Kubernetes projects and none used sidecars widely.
The major benefit of Kubernetes for them was that you could use lots of cheap, ephemeral cloud instances to run your platform whilst still having high availability. It ended up saving a ridiculous amount of money.
Hundreds, in my case. We have a sidecar here or there, but essentially our entire operation is run in distroless containers that consume config maps. We have one source of truth for 5 baremetal cloud regions, a number of private on-prem cloud regions, our build and test infrastructure, and nearly everything else, it is our Argo repo and the auto-generated operator manifests from our operator mono-repo. We have a common client library that abstracts our CRDs into easy to consume functions, and in the end using Kubernetes as an API for infrastructure operations does exactly what it should; allows full consistency and visibility on configuration.
you did not even understood what k8s does. trust me you want something like nomad or k8s. else you will write your simple k8s thing anyway, its just harder to understand since you just wrote your own solution. k8s even works in small scale, heck it didn't even run at first in really big deployments.
Which problem is more serious? 1) your small company has an over-complex system that could have been postgres; 2) your medium-sized company has a postgres that's on fire at the bottom of the ocean every day despite the forty people you hired to stabilize postgres, and your scalable replacement system is still six months away?
#1 is more serious. #2 limits the growth of your already successful company. #1 sinks your struggling small business. You have to be successful to be a victim of your own success, after all. Not to mention the fact that #1 is way more common. Do you know how far Postgres scales? Because it's way past almost any medium- scale business.
Exactly. A lot of us work at #2 so we wish our predecessors saved us our current pain. But if they went that route we wouldn't be employed at that company because it wouldn't exist
Exactly, if a medium-sized company is struggling with Postgres, either they have very niche requirements or the scalability problems are in their own code.
What about #1b: you have an overly-complex "system", but most of that "system" is serverless (i.e. managed architecture that's Somebody Else's Problem), with your own business-logic only being exposed to a rather simple API?
I'm thinking here of engineering teams who, due to worries about scaling their query IOPS, turn not to running a Hadoop cluster, but rather to using something like Google's BigTable.
In the second scenario, they can't do math. They could have bought themselves 6-18 months by getting the most powerful machine available using probably at most 1-2 salaries worth of those 40 people.
Less a single digit percentage of workloads needs massive, hard to use horizontal scale out (for things that can solved on a single machine, or a single database).
MR is useful as an adhoc scheduler over data. Need to OCR 10k files, MR it.
Hadoop was the worst possible implementation of MR, wasted so much of everything. That was its primary strength.
Very early on in my enterprise career, in a continuance of a discussion where it was mentioned that our customer was contemplating a terabyte disk array (that would fill an entire server rack, so very fucking early) I learned about the great grandfather of NVME drives: battery backed RAM disks that cost $40k inflation adjusted.
“Why on earth would you spend the cost of a brand new sedan on a drive like this?” I asked. Answer: to put the Oracle or DB2 WAL data on so you could vertically scale your database just that much higher while you tried to solve the throughput problems you were having another way. It was either the bargaining phase of loss or a Hail Mary you could throw in to help a behind-schedule rearchitecture. Last resort vertical scaling.
Reminds me when I had a 3-machine Hadoop cluster in my home lab and 2 nodes were turned off but I was submitting jobs to get and getting results just fine.
I remember all the people pushing erasure code based distributed file systems pointing out how crazy it is to have three copies of something but Hadoop could run in a degraded condition without degraded performance.
I agree. I used Disco MR to do amazing things. Trivial to use, like anyone could be productive in under an hour.
Erasure codes are awesome, but so is just having 3 copies. When you have skin in the game, simplicity is the most important driver of good outcomes. Look at the dimensions that Netezza optimized, they saw a technological window and they took it. Right now we have workstations that can push 100GB/s from from flash. We are talking about being able to sort 1TB of data in 20 seconds (from flash) the same machine could do it from ram in 10.
I don't know where to put this comment so I'll put it here. DeWitt and Stonebraker are right, but also wrong. Everyone is talking past each other there. Both are geniuses, this essay wasn't super strong.
If I was their editor, I would say, reframe it as MapReduce is an implementation detail, we also need these other things for this to be usable by the masses. Their point about indexes proves my point about talking past each other. If you are scanning the data basically once, building an index is a waste.
No, plenty of tech debt is caused by over-engineering or pre-maturely optimizing for the wrong thing.
I'm not sure if the second outcome is meant to blame Postgres specifically on under-engineering in general, but neither seems to me like it should be a concern for an early-stage startup.
I have found that these fires become uncontrollable because of tech debt. Whole rarely the spark, it’s a latent fuel source.
It’s like our modern forests; unless something clears out the brush, we see wildfires start from the smallest spark. Once it starts, it’s almost impossible to do anything but try to limit the extent of the disaster.
Necessity is the mother of invention. MapReduce-based systems were developed because the state-of-the-art RDBMS systems of that age could not scale to the
needs of the Googles/Yahoos/Facebooks during the phenomenal growth spurt of the early Web. The novelty here was the tradeoffs they made to scale out and up using the compute and storage footprints available at the time.
"We thought of that" vs "we built it and made it work".
MapReduce was never built to compete with RDBMS systems. It was built to compete with batch-scheduled distribution data processing, typically where there was no index. It was also built to build indices (the search index), not really use them during any of the three phases. It was also built to be reliable in the face of cheap machines (bad RAM and disk).
Google built MR because it was in an existential crisis: they couldn't build a new index for the search engine, and freshness and size of the index was important for early search engines. The previous tools would crash part-way through due to the cheap hardware that Google bought. If Google had based search indexing on RDBMS, they would not exist today.
Now Google did use RDBMS- they used MySQL at scale. It wasn't unheard-of for mapreduces to run against MySQL (typically doing a query to get a bunch of records, and then mapping over those records).
I worked on later mapreduce (long after it was mature) which used all sorts of tricks to extend the MapReduce paradigm as far as possible but ultimately nearly everything got replace with Flume, which is effectively a computational superset of what MR can do.
I think the paper must have been pulled because Stonebreaker must have gotten huge pushback for attacking MR for something it wasn't good at. See the original paper for what they proposed as good use cases: counting word occurences in a large corpus (far larger than the storage limits of postgres and others at the time), distributed grep (without an index), counting unique items (where the number of items is larger than the capacity of a database at the time), reversing a graph (convert (source, target) pairs to (target, [source, source, source]), term vectors, inverted index (the original use case for building the index) and distributed sort. None of the RDBMS of that day could handle the scale of the web. That's all.
> Stonebreaker must have gotten huge pushback for attacking MR for something it wasn't good at
I like this comment because it gets to the heart of a misunderstanding. I'd further correct it to say "for something it wasn't trying to be good at". DeWitt and Stonebraker just didn't understand why anyone would want this, and I can see why: change was coming faster than it ever did, from many angles. Let's travel back in time to see why:
The decade after mapreduce appeared - when I came of age as a programmer - was a fascinating time of change:
The backdrop is the post-dotcom bubble when the hype cycle came to a close, and the web market somewhat consolidated in a smaller set of winners who now were more proven and ready to go all in on a new business model that elevates doing business on the web above all else, in a way that would truly threaten brick and mortar.
Alongside that we have CPU manufacturers struggling with escalating clock speeds and jamming more transistors into a single die to keep up with Moore's law and consumer demand, which leads to the first commodity dual and multi core CPUs.
But I remember that most non-scientific software just couldn't make use of multiple CPUs or cores effectively yet. So we were ripe for a programming model that engineers who've never heard of lamport before can actually understand and work with: threads and locks and socket programming in C and C++ were a rough proposition, and MPI was certainly a thing but the scientific computing people who were working on supercomputers, grids, and Beowulf clusters were not the same people as the dotcom engineers using commodity hardware.
Companies pushing these boundaries were wanting to do things that traditional DBMSes could not offer at a certain scale, at least for cheap enough. The RDBMS vendors and priesthood were defending that it's hard to offer that while also offering ACID and everything else a database offers, which was certainly not wrong: it's hard to support an OLAP use case with the OLTP-style System-R-ish design that dominated the market in those days. This was some of the most complicated and sophisticated software ever made, imbued with magiclike qualities from decades of academic research hardened by years of industrial use.
Then there was data warehouse style solutions that were "appliances" that were locked into a specific and expensive combination of hardware and software optimized to work well together and also to extract millions and billions of dollars from the fortune 500s that could afford them.
So the ethos at the booming post-dotcoms definitely was "do we really need all this crap that's getting in our way?", and we would soon find out. Couching it in formalism and calling it "mapreduce" made it sound fancier than what it really was: some glue that made it easy for engineers to declaratively define how to split work into chunks, shuffle them around and assemble them again across many computers, without having to worry about the pedestrian details of the glue in between. A corporate drone didn't have to understand /how/ it worked, just how to fill in the blanks for each step properly: a much more viable proposition than thousands of engineers writing software together that involves finnicky locks and semaphores.
The DBMS crowd thumbed their noses at this because it was truly SO primitive and wasteful compared to the sophisticated mechanisms built to preserve efficiency that dated back to the 70s: indexes, access patterns, query optimizers, optimized storage layouts. What they didn't get was that every million dollar you didn't waste on what was essentially the space shuttle of computer software - fabulously expensive and complicated - could now buy a /lot/ more cheapo computing power duct taped together. The question was how to leverage that. Plus, with things changing at the pace that they did back then, last year's CPU could be obsolete by next year, so how well could the vendors building custom hardware even keep up with that, after you paid them their hefty fees? The value proposition was "it's so basic that it will run on anything, and it's future proof" - the democratization aspect could be hard to understand for an observer at that point, because the tidal wave hadn't hit yet.
What came was the start a transition from datacenters to rack mounts in colos and dedicated hosts to virtualization and very soon after the first programmable commodity clouds: why settle for an administered unixlike timesharing environment when you can manage everything yourself and don't have to ask for permission? Why deal with buying and maintaining hardware? This lowered the barrier for smaller companies and startups who previously couldn't afford access to such things nor markets that required them, which unleashed what can only be described as a hunger for anything that could leverage that model.
So it's not so much that worse was better, but that worse was briefly more appropriate for the times. "Do we really need all this crap that's getting in our way?" really took hold for a moment, and programmers were willing to dump anything and everything that was previously sacred if they thought it'd buy them scalability, schemas and complex queries to start.
Soon after, people started figuring out how to maintain all the benefits they'd gained (democratized massively parallel commodity computing) while bringing back some of the good stuff from the past. Only 2 years later, Google itself published the BigTable paper where it described a more sophisticated storage mechanism which optimized accesses better, and admittedly was tailored for a different use case, but could work in conjunction with mapreduce. Academia and the VLDB / CIDR crowd was more interested now.
Some years after that came out the papers for F1 and Spanner, which added back in a SQL-like query engine, transactions, secondary indexes etc on top of a similar distributed model in the context of WAN-distributed datacenters. Everyone preached the end of nosql and document databases, whitepapers were written about "newsql", frustrated veterans complained about yet another fad cycle where what was old was new again.
Of course that's not what happened: the story here was how a software paradigm failed to adapt to the changing hardware climate and business needs, so capitalism had its guts ripped apart and slowly reassembled in a more context-applicable way. Instead of storage engines we got so many things it's hard to keep up with, but leveldb comes to mind as an ancestor. With locks we got was chubby and zookeeper. With log structures we got kafka and its ilk. With query optimizer engines we got presto. With in-memory storage we got arrow. We got a cambrian explosion of all kinds of variations and combinations of these, but eventually the market started to settle again and now we're in a new generation of "no, really, our product can do it all". It's the lifecycle of unbundling and rebundling. It will happen again. Very curious what will come next.
It's worth noting, in addition to looking down upon MapReduce, Stonebraker and the academic crowd were equally unimpressed by other commodity hardware scale-out practices of the time – including the practical-minded RDBMS sharding and caching strategies used by all the booming massive-scale startups.
In 2011, in an interview with GigaOm, Stonebraker famously called Facebook's use of sharded MySQL and Memcached "a fate worse than death" [1]. He also claimed the company should redesign their entire infrastructure, seemingly without clarifying what specific problem he thought needed to be solved.
The reader comments on that post are also quite interesting in tone.
Edit to add a disclosure: I joined Facebook's MySQL team a couple years after this, and quite enjoyed their database architecture, which certainly colors my opinion of this topic.
Thanks! :-) It's been brewing in my head for a while and I've written out shorter snippets of similar ideas over the course of the past few years.
I've taken some liberties and likely made some mistakes, but it's aping the kind of "history of technology in its social context" that I love to read about.
My recollection of the time is that lots of people thought they needed to use MapReduce for their "big data" but their data was like 100GB of logs they wanted to run a O(N) analysis on.
I wonder whether they pulled their article because they changed their mind. Maybe over time they got to understand the difference in use-cases and tradeoffs. Though it's definitely true that plain MapReduce can be shockingly inefficient, systems like Spark which add rich data structures and in-memory caching make a big improvement towards closing that gap. But it's still not a panacea, and I've definitely seen usecases where SQLite could replace and far outperform Spark, but the general programming model behind Spark is still a powerful and widely applicable abstraction.
My main complaint with Spark is that it's pretty hard for non-experts to debug crashes and failures. While the programming model beautifully abstracts away all the complexities of parallel and distributed programming with its headaches of managing clusters and dealing with synchronization and communication, the entire abstraction bursts instantly when debugging is needed. Suddenly you need to understand the entire Spark programming model from top-to-bottom, from DataFrames to RDDs and partitions, along with implementation details like shuffle files, DAGScheduler, block caching, Python-JVM interop, OOM killer, and on and on.
And there's absolutely no incentive for the commercial vendors to improve this situation in the open source project. (This isn't specific to Spark though, the entire Hadoop ecosystem seems to operate on hidden complexity in the open source version with vendors who make money by providing support and more usable distributions.)
I'll be honest, the performance issue is almost never the language used.
You can write Python code that's very fast for almost any usecase (using numpy/pandas/vaex/polars/etc). Similarly for Java.
The main thing is the way people use Python/Java is generally extremely inefficient - the code is full of cache misses because of how the idiomatic way to use those languages conflicts with how CPU caches work.
Is Spark still used anywhere? I haven't heard about it in a loooong time. At the time it seemed like a very nice abstraction, but while I played around with it, I never had a problem that needed such a complex solution.
Spark probably powers 90% of FANG batch data pipelines. Theres still lots of hive and Hadoop clusters out there that are just being maintained for reporting until they get funding to swap to some fancy DBT setup.
Spark is pretty much ubiquitous and the default solution for batch processing right now (especially pyspark). There’s also a lot of users of AWS Glue with spark.
PySpark instead of spark, but I had a job a couple years back using it in glue to generate financial reports. No longer on the project, but I'm pretty sure they're using it.
Honestly wasn't that bad of a model. But, then again, the job didn't actually need spark, someone just sold it that way before I was on the project. Fun to work with though
Isn't the Hadoop story also playing out with spark--the open source version is kinda buggy, while the spark committers at Databricks have retained all the fixes for themselves.
Interesting that all of the successors to MapReduce fix all the major criticism DeWitt and Stonebraker have. (I'm thinking BigQuery, Snowflake, Spark) So the lessons were re-learned and features re-introduced, but the massively parallel execution has persisted.
Perhaps MapReduce was non-novel and flawed, but it certainly seems to have led to a flowering of rich, large scale data querying systems.
Some of these issues were solved by Spark. Do agree with the overall point, people shouldn’t be reaching for Hadoop when Postgres would suffice. Indexes are fast. Use them.
Indexes are fast when they're built well and used often. Indexes are expensive (and paid for in triplicate via backup costs) when they are seldom or never used. Sometimes you just need to materialize a table temporarily, which of course you can do in the RDBMS as well, but sometimes the data sources are so scattered (or also ephemeral) that keeping all processing inside the DB system is a stretch.
But perhaps the most compelling justification is based on the DB systems familiarity on the team. Not everyone has the same level of SQL expertise and some of the visualization tools added to MapReduce systems and the source language itself are more familiar to them than the output of an EXPLAIN statement. Especially if the same pipeline is effectively hundreds of lines in SQL.
> people shouldn’t be reaching for Hadoop when Postgres would suffice. Indexes are fast. Use them.
let me try a riskier one: "people should not be reaching for kubernetes when a system administrator would suffice. sysadmins are cheap(er?) than clouds, use them"
it did not come out very well.... I got stuck trying to find what to contrast kube with; all I got in them minutes alloted to comment posting was 'system administrator'. meh.
Everyone wants to use k8s and like 1% need it in any shape or form. It basically acts as a conspiracy between otherwise-redundant ops level people and those running Kubernettes against the companies who have to pay for all this. Just use Fargate, and be done.
The objection I have to this paper is that I keep seeing SQL forced onto streaming workflows when the functional programming operations provided by the non-SQL APIs of systems like Flink and Spark are a lot easier to think about in this context: SQL stream joins often have very surprising performance properties and it can take a long time to figure out the tunables needed to make them perform at scale.
MapReduce is a good example of choosing tools suitable to the task. I'll agree with the authors that there are certainly large DB systems that a traditional RDBMS is better suited or more optimal.
But the designers and avid users of MapReduce did not use it because it seemed a more optimal DB query engine. The utility|cost being optimized was not compute cycles or disk seeks or logappends, it was the developer time needed to construct another large-scale OLAP definition and an overall tolerance to hardware and system failure.
Having been an early adopter (2007) and riding the whole wave I think the entire movement ignited with MapReduce, BigTable, etc. was probably one of the best things that happenned in the industry.
For me personally, it allowed to break things down to first principles in a way that "industry coding" wouldn't have been able to. It was the practical side of theory that was confined in school.
There were, however, two types of big data adopters. I was in the bottom up camp, where passion for learing and finding the best solution to the problem was the driver. The top-down camp that eventually filled the Hadoop conferences by the time they got large (>1000 people) I suspect didn't get much out of it, neither for their organizations, nor personally.
So back to Stonebreaker, back then, same as now, looks like a frustration more than anything. I do understand where it comes from, but still a frustration more than anything.
Relational algebra is nice, but classical databases and SQL never nailed neither theory nor practice. NoSQL for me was more NoOracle, No<MSSQL, etc. and an ability to learn by doing from the ground up.
I was at Yahoo during the time of Hadoop/MapReduce. I'd summarize the value of MapReduce in one sentence as: a solution you can use is infinitely better than a solution you can't. An optimized DB might bo much better for a specific workload. To get an optimized DB would take 6 months of conversations with a half dozen teams and getting approval for hardware. Then you'd use it for a few days before moving to your new workload. Then get yelled t for the waste of resources. To use MapReduce you logged into a cluster and ran your code the same day. MapReduce wasn't replacing databases. It was replacing local scripts running on database dumps and dozen machine clusters cobbled together to improve throughput. It's clear from the backgrounds of the people who wrote that piece that they never had to be in the shoes of the people actually using MapReduce on a daily basis and getting value from it.
edit: This is a roughly the same reason cloud took off. Cloud costs more but waiting 6 months for IT to deploy a half-bake compromise solution is significantly more expensive for the business in lost opportunity and productivity.
The note seems oddly reductionist or strawman-indulging. Nothing about MR says you can't use indices. In a sense, MR is really about sharding!
It would be interesting to know how often MR (and its line) has been used on datasets, and execution hardware, which didn't suit standard SQL approaches. But it's also fundamentally different targets (again, both data and hardware). You probably shouldn't be doing MR-like approaches on less than dozens of (commodity, local-disk) nodes, and I think that's a pretty extreme tail for standard monolithic SQL...
This reads as though they had taken issue with some of the movement away from traditional relational databases in that era. I know it's been retracted, so who knows what the authors opinions are today, but the eureka moment to me was in horizontal scalability - not the map and reduce functions. Until then if you wanted something to go faster you simply bought a bigger computer.
In my work many of the MR use cases didn't necessarily compete with activities you'd expect of a database. For example - take the entire Geo database and generate the map tiles. I also must have written hundreds of MRs that essentially transform the entire dataset. You certainly can do that in a relational database but it seems like the wrong approach. Similarly if I was going to perform a bunch of queries I wouldn't write a new MR each time.
This was not my first programming job and I didn't mean to imply stupidity. Maybe you could give some examples?
I had seen it in some contexts: distcc, networking, seti@home. At least to me it didn't seem as common a tactic as trying to scale up the hardware on the machine or designing a different algorithm. That's generally what I saw at Microsoft in the years before.
Maybe it was more the power of combining it with a "cloud" and containers - being able to change just the value of --mapreduce_machines= to something like 10000 and seeing it happen. That's the first time I had been exposed to that.
I was an apache committer/contributor to a few of the hadoop ecosystem and found this paper to be very inspiring. I totally agree with much of it and it really changed the way I look at the world. I would go farther nowadays especially that databases are so smart.
It was never about no-schema it was about schema-on-read.
And it came about because companies were wasting millions doing pointless up front schema design for data that were never queried. As well as the ongoing challenge of trying to keep schemas up to date for systems they don't control e.g. SaaS.
I am responsible for a MapReduce-based system built about 13 years ago and it is the bane of my team’s existence right now. It was the hotness in its day, but did not age well at all. We are working on a replacement using ClickHouse.
Distributed SQL backends such as BigQuery and Spanner basically use MapReduce underneath anyway. As do most other distributed backends, SQL or not, capable of aggregation between shards.
Yes, stuff that splits up, shuffles, and combines the result is exactly what I mean. That’s MapReduce. It’s just that e.g. in Dremel/BigQuery the shuffle step happens via a special in-memory distributed system, but the fundamental idea is the same: do what you can on the node, partition/shuffle, reduce on the receiving nodes, sometimes more than once.
I think that in academic work this is reasonable. People sometimes claim to have done something novel, when it is clearly a repetition of something older. Perhaps the citations of the works being criticized didn't make it clear that they were repeating an old idea. It allows the reader to see that there were relevant previous works that may have been subjected to criticism in the past and that those previous criticisms remain important for the more recent work. The question then is whether the recent work resolves those criticisms. The authors in this case seem not to think so.
A better paraphrase is the following: The new idea is not so new as some claim. It is 20 years old and was previously shown to be flawed in comparison to other approaches. Looking over 40 years of literature, one can see the flaws in the approach and how the flaws were subsequently resolved. Proponents of the "new idea" are either unaware of these advancements or are ignoring them.
I'd say that ignoring previous literature and work is a big problem is CS adjacent studies. My experience is being a theoretical linguist who interfaces with computational linguists. I have had colleagues receive criticism for citing work that is 10 years old or more, even if such a work represents the earliest example of a particular idea they are making use of. It is suggested that a more modern work should be cited. There is kind of an "anti-memory" culture that results from trying to make work seem cutting edge, even if a work is clearly an extension or reinvention of very old ideas.
I also agree we should stop using MapReduce because of its lack of support for Crystal Reports. /s
Kidding aside it's easy to bash a paper like this with the benefit of hindsight and move on. But instead look at what we can take from it.
The benefit of new tech isn't always by being better than old tech at its strengths, it can also be by being better at old tech at its weaknesses and finding a way to change the problem domain so that its strengths don't matter much.
100% agree. mapReduce hype always seemed strange to me because it's basically the volcano paper from the 90s but with custom user defined operators instead of pre baked ones in a more traditional engine.
To make everything worse, hadoop came along, ignoring every industry advance of the past 40 years with its "one tuple at a time" iterator based model on a garbage collected language. I realize it's very easy for me to say those things in hindsight, but it's not like vectorized execution was a weird obscure secret by the time these things came out.
On a side note, it finally looks like the industry is moving towards saner tools that implement a lot of things that this article mentions mapReduce was missing
How good is your vectorized execution engine at dealing with a handful of storage nodes going down for an hour or two? Or figuring out when bit flips have randomly happened? Or at sharing resources with latency-sensitive serving jobs?
"Custom user defined operators" did a lot of heavy lifting at Google over decades.
The set of appropriate use cases is getting smaller, mostly due to sql-ish systems scaling up towards what could actually (and frequently, only) be done with MapReduce/Flume.