Trivial example of what people need to know in enterprises: *“Tell me whether pe...

thibaut_barrere · on Sept 30, 2010

SQL is actually quite often a bad way to try to answer those questions, too! See http://philip.greenspun.com/wtr/data-warehousing.html for an entertaining explanation.

I believe MongoDB in particular can be a fairly good solution to build datawarehouses (I'm starting to use it for reporting systems).

One great point about MongoDB is that it makes the ETL process a lot easier (you don't have to prepare tables with the right schema and it supports large amounts of data).

I wouldn't be surprised to see some NoSQL solutions get wider adoption in the enterprise, either alone or with tools that build upon them.

As for the article: it's pure linkbait in my opinion!

jaxn · on Sept 30, 2010

I used to be a Business Intelligence consultant for enterprises. We built reports, data warehouses, dashboards, etc. From my experience, the article is spot on, not linkbait.

Maybe MongoDB is better once you have a well defined query that you need, but I think the point of the parent comment is that those examples of queries are ad-hoc. NoSQL is not as good as SQL when it comes to report specs that are constantly in a state of flux.

I need my data available to answer questions. When building a product you have a well defined set of operations based on the features of your product. When the requirements shift on a regular basis, NoSQL is too limiting.

When the article talks about a low level query language being too limiting, they are talking about missing things like CONNECT BY PRIOR or SUM(CASE IF col IN ('a','b','c') THEN 1 ELSE 0). These are the same kinds of things that are difficult to do with an ORM.

thibaut_barrere · on Oct 1, 2010

I am currently doing reports/datawarehouses/dashboards. When something more complicated that simple questions is needed (see Data Warehousing for Cavemen), ad-hoc queries are quite often not the answer anymore, either with NoSQL or with SQL.

I don't want my clients to be dependent on me (or someone else) to build complicated SQL queries when they have questions, so I focus on getting an easy to maintain facts/dimensions model (as advocated by Ralph Kimball http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimens...) which can evolve if needed.

The nice point about MongoDB when doing this is that it makes it a lot easier to add attributes to dimensions, or load the data, or evolve the reporting system in general (and I like that).

You can apply the same principles to build dimensions/facts based data structure and answer questions that SQL alone wouldn't be able to answer easily.

Example of such question: how many calls did we receive during french legal week #9 that were handled by team X outside the normal working hours or while we were in vacations ? In those calls, how many were issued by a woman (as it has a financial impact in this case) ?

aloneinkyoto · on Sept 30, 2010

Both of those queries are very easy to perform in MongoDB.

For examples of how to easily model and query trees (CONNECT BY PRIOR in SQL) see: http://www.mongodb.org/display/DOCS/Trees+in+MongoDB

SUM(CASE IF col IN ('a','b','c') THEN 1 ELSE 0) can be implemented as a group or mapreduce query.

I would say that you have a lot more power and flexibility in MongoDB compared to an average SQL database when it comes to ad-hoc querying.

jaxn · on Sept 30, 2010

I am not a MongoDB expert by any stretch, so please correct me where I am wrong.

The way I read your link it sounds like I need to store the data in a particular way in order to run a parent/child query. That is great if I know that I need that query at design time. What happens if I have tens of millions of records and need to run that report on an ad-hoc basis? Where the relationship may or may not be important?

What if I want to sum the "cost of goods sold" one day and the "items per transaction" the next? Does that not require someone to write code more complex than SQL? Because on Oracle a business analyst can open up Toad and run that query.

If I am wrong then it very well may be that the problem is one of the enterprise not being aware.

aloneinkyoto · on Oct 1, 2010

Ease of use is subjective. I dont think writing a mapreduce job needs to be more complex than writing an equivalent SQL query. What really matters is elegance, flexibility and power.

My personal experince is that the MongoDB model seems to win in most cases. Especially when it comes to flexibility and ad-hoc querying. Having a real language (javascript) and a flexible schema tend to make most business problems easier to express.

Locke1689 · on Oct 1, 2010

Ease of use is subjective. I dont think writing a mapreduce job needs to be more complex than writing an equivalent SQL query. What really matters is elegance, flexibility and power.

Unfortunately it seems you have completely misunderstood the nature of both SQL and MapReduce. MapReduce is a distributed computation engine. While it can be used in that way it was never meant to be a database system. BigTable is proof enough of that.

In general, SQL is the syntactical representation of relational algebra with some hacky additions for programmer convenience. Comparing just "SQL" to the MongoDB language model is misguided since you then break down to a question of algebraic expressivity and relational power.

I'm not going to try and build a proof here but we do know that a formal relational algebra system is equivalent to first-order logic. As far as MongoDB's relational language goes, one would probably have to make an argument that it is equivalent to either tuple or domain relational calculus, but I know of no theoretical work that has attempted this. If anyone has any more information to the theoretical expressiveness of the MongoDB relational system I would love to read it.

aloneinkyoto · on Oct 1, 2010

I was not arguing about relational algebra or theoretical expressivity or logical equivalence or anything like that. I was simply stating that in practice most business problems are easier to model and more flexible to query in the MongoDB model.

Of course you need some time get used to thinking in terms of documents rather than tables and rows. But once you get used to the idea you can easily model most domains that occur in practice.

> MapReduce is a distributed computation engine. While it can be used in that way it was never meant to be a database system. BigTable is proof enough of that.

Yes, MapReduce in the Google and Hadoop sense is designed for massive batch processing. That's why BigTable and HBase exists. MapReduce in the CouchDB and MongoDB sense is a Turing complete query and processing layer built on top of a column store. In the CouchDB case MapReduce is the only way you can query the database.

http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views http://www.mongodb.org/display/DOCS/MapReduce http://www.mongodb.org/display/DOCS/Aggregation#Aggregation-...

ergo98 · on Oct 1, 2010

>SQL is actually quite often a bad way to try to answer those questions, too! See http://philip.greenspun.com/wtr/data-warehousing.html for an entertaining explanation.

That's why data warehouses rely upon cubes/OLAP for analysis. It is a specialized solution that serves the need very well.

>One great point about MongoDB is that it makes the ETL process a lot easier (you don't have to prepare tables with the right schema and it supports large amounts of data).

So does a CSV. In fact, so does the last silver bullet, which is XML. XML is a loose or as strict as you want it to be.

thibaut_barrere · on Oct 1, 2010

OLAP on SQL comes at a cost, too (which is why some people are diving into analytics and reporting with NoSQL tools, where you can add one server without large expenses).

On CSV/XML, my point wasn't clear enough: I wanted to underline the fact that it's a lot easier to load dimensions data then load facts data and achieve foreign keys lookups when working with MongoDB (it's not about the file format, it's about the loading/lookup part which is a large part of ETL in my cases).

senya72 · on Oct 2, 2010

how about high-perf joins at run-time?

ergo98 · on Oct 1, 2010

For the people who need OLAP analysis, the relevant expense range is seldom that much of a consideration. I'm looking at storage systems right now that costs $800,000. It's considered mid-range and is merely the starter system.

Note that OLAP is, in many regards, NoSQL. It is really the most successful variant of NoSQL.

However I'm very curious what sort of analytics people are doing with NoSQL. I have seen people essentially generating reports to MongoDB, for instance, but I have never seen anything remotely approaching flexible analytics on such a system.

gaius · on Oct 1, 2010

That article is very peculiar

A data warehouse is a separate RDBMS installation that contains copies of data from on-line systems. A data warehouse would not be necessary if RDBMS software worked as advertised. It is merely a $10 million bandaid applied to the limitations of modern computers and RDBMS software.

If you want to do more computation, you require more computers, and this is "insight"?

You might as well say "A level 2 cache would not be necessary if RAM worked as advertised". Duh!

ratsbane · on Oct 1, 2010

Having spent an unpleasantly large part of my life enmeshed in it, I strongly disagree with the last statement "with really good application software on top of it."

Otherwise, spot-on.