Trivial example of what people need to know in enterprises:
“Tell me whether pet rocks are selling better than Barbie dolls in the south?”
What people really need to know:
I already know from existing reporting that 984 orders (18% of our backlog) are already past due. For those 984 orders:
- How many are for one item and how many are for multiples?
- Do we own what we owe those customers?
- If we do own it, is it in the proper warehouse?
- If it is in the proper warehouse, can we find it?
- If we can find it, is it undamaged and certified?
- If it's shippable, do we have enough labor to ship it?
- If it isn't certified, how soon can QA certify it?
- If it isn't in the right warehouse, can we move it?
- If we don't own any, where can we get some?
- Which vendors have it on the shelf?
- Which vendors do we have blanket purchase orders with?
- Which vendors do we have contracts with?
- Which orders can be split to satisfy a partial?
- Which orders are for customers already on credit hold?
- Which customers are threatening not to renew with us?
and (ironically) the most asked question of all:
- Which orders must be shipped to hit our quarterly numbers?
I can go on and on; this is just off the top of my head. We like to pick on enterprises, but this is the shit that really happens all the time. So whenever you get gas in your car, bread on your table, new shoes at the mall, steamed milk in your latte, etc., etc., etc., rest assured that someone, somewhere has asked these questions. Questions that were probably answered using some form of RDBMS, SQL, ACID technology (with really good application software on top of it).
I believe MongoDB in particular can be a fairly good solution to build datawarehouses (I'm starting to use it for reporting systems).
One great point about MongoDB is that it makes the ETL process a lot easier (you don't have to prepare tables with the right schema and it supports large amounts of data).
I wouldn't be surprised to see some NoSQL solutions get wider adoption in the enterprise, either alone or with tools that build upon them.
As for the article: it's pure linkbait in my opinion!
I used to be a Business Intelligence consultant for enterprises. We built reports, data warehouses, dashboards, etc. From my experience, the article is spot on, not linkbait.
Maybe MongoDB is better once you have a well defined query that you need, but I think the point of the parent comment is that those examples of queries are ad-hoc. NoSQL is not as good as SQL when it comes to report specs that are constantly in a state of flux.
I need my data available to answer questions. When building a product you have a well defined set of operations based on the features of your product. When the requirements shift on a regular basis, NoSQL is too limiting.
When the article talks about a low level query language being too limiting, they are talking about missing things like CONNECT BY PRIOR or SUM(CASE IF col IN ('a','b','c') THEN 1 ELSE 0). These are the same kinds of things that are difficult to do with an ORM.
I am currently doing reports/datawarehouses/dashboards. When something more complicated that simple questions is needed (see Data Warehousing for Cavemen), ad-hoc queries are quite often not the answer anymore, either with NoSQL or with SQL.
I don't want my clients to be dependent on me (or someone else) to build complicated SQL queries when they have questions, so I focus on getting an easy to maintain facts/dimensions model (as advocated by
Ralph Kimball http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimens...) which can evolve if needed.
The nice point about MongoDB when doing this is that it makes it a lot easier to add attributes to dimensions, or load the data, or evolve the reporting system in general (and I like that).
You can apply the same principles to build dimensions/facts based data structure and answer questions that SQL alone wouldn't be able to answer easily.
Example of such question: how many calls did we receive during french legal week #9 that were handled by team X outside the normal working hours or while we were in vacations ? In those calls, how many were issued by a woman (as it has a financial impact in this case) ?
I am not a MongoDB expert by any stretch, so please correct me where I am wrong.
The way I read your link it sounds like I need to store the data in a particular way in order to run a parent/child query. That is great if I know that I need that query at design time. What happens if I have tens of millions of records and need to run that report on an ad-hoc basis? Where the relationship may or may not be important?
What if I want to sum the "cost of goods sold" one day and the "items per transaction" the next? Does that not require someone to write code more complex than SQL? Because on Oracle a business analyst can open up Toad and run that query.
If I am wrong then it very well may be that the problem is one of the enterprise not being aware.
Ease of use is subjective. I dont think writing a mapreduce job needs to be more complex than writing an equivalent SQL query. What really matters is elegance, flexibility and power.
My personal experince is that the MongoDB model seems to win in most cases. Especially when it comes to flexibility and ad-hoc querying. Having a real language (javascript) and a flexible schema tend to make most business problems easier to express.
Ease of use is subjective. I dont think writing a mapreduce job needs to be more complex than writing an equivalent SQL query. What really matters is elegance, flexibility and power.
Unfortunately it seems you have completely misunderstood the nature of both SQL and MapReduce. MapReduce is a distributed computation engine. While it can be used in that way it was never meant to be a database system. BigTable is proof enough of that.
In general, SQL is the syntactical representation of relational algebra with some hacky additions for programmer convenience. Comparing just "SQL" to the MongoDB language model is misguided since you then break down to a question of algebraic expressivity and relational power.
I'm not going to try and build a proof here but we do know that a formal relational algebra system is equivalent to first-order logic. As far as MongoDB's relational language goes, one would probably have to make an argument that it is equivalent to either tuple or domain relational calculus, but I know of no theoretical work that has attempted this. If anyone has any more information to the theoretical expressiveness of the MongoDB relational system I would love to read it.
I was not arguing about relational algebra or theoretical expressivity or logical equivalence or anything like that. I was simply stating that in practice most business problems are easier to model and more flexible to query in the MongoDB model.
Of course you need some time get used to thinking in terms of documents rather than tables and rows. But once you get used to the idea you can easily model most domains that occur in practice.
> MapReduce is a distributed computation engine. While it can be used in that way it was never meant to be a database system. BigTable is proof enough of that.
Yes, MapReduce in the Google and Hadoop sense is designed for massive batch processing. That's why BigTable and HBase exists. MapReduce in the CouchDB and MongoDB sense is a Turing complete query and processing layer built on top of a column store. In the CouchDB case MapReduce is the only way you can query the database.
That's why data warehouses rely upon cubes/OLAP for analysis. It is a specialized solution that serves the need very well.
>One great point about MongoDB is that it makes the ETL process a lot easier (you don't have to prepare tables with the right schema and it supports large amounts of data).
So does a CSV. In fact, so does the last silver bullet, which is XML. XML is a loose or as strict as you want it to be.
OLAP on SQL comes at a cost, too (which is why some people are diving into analytics and reporting with NoSQL tools, where you can add one server without large expenses).
On CSV/XML, my point wasn't clear enough: I wanted to underline the fact that it's a lot easier to load dimensions data then load facts data and achieve foreign keys lookups when working with MongoDB (it's not about the file format, it's about the loading/lookup part which is a large part of ETL in my cases).
For the people who need OLAP analysis, the relevant expense range is seldom that much of a consideration. I'm looking at storage systems right now that costs $800,000. It's considered mid-range and is merely the starter system.
Note that OLAP is, in many regards, NoSQL. It is really the most successful variant of NoSQL.
However I'm very curious what sort of analytics people are doing with NoSQL. I have seen people essentially generating reports to MongoDB, for instance, but I have never seen anything remotely approaching flexible analytics on such a system.
A data warehouse is a separate RDBMS installation that contains copies of data from on-line systems. A data warehouse would not be necessary if RDBMS software worked as advertised. It is merely a $10 million bandaid applied to the limitations of modern computers and RDBMS software.
If you want to do more computation, you require more computers, and this is "insight"?
You might as well say "A level 2 cache would not be necessary if RAM worked as advertised". Duh!
Having spent an unpleasantly large part of my life enmeshed in it, I strongly disagree with the last statement "with really good application software on top of it."
“Tell me whether pet rocks are selling better than Barbie dolls in the south?”
What people really need to know:
I already know from existing reporting that 984 orders (18% of our backlog) are already past due. For those 984 orders:
I can go on and on; this is just off the top of my head. We like to pick on enterprises, but this is the shit that really happens all the time. So whenever you get gas in your car, bread on your table, new shoes at the mall, steamed milk in your latte, etc., etc., etc., rest assured that someone, somewhere has asked these questions. Questions that were probably answered using some form of RDBMS, SQL, ACID technology (with really good application software on top of it).