So up front: I know this question sounds like a classic flame, however I'm honest about it, and the similarity to flame means I have yet to see an answer that I am satisfied with, as the discussions degenerate quickly. Anyway:
What is non-relational data? I have been under the impression for a long time that datasets all have internal relationships, be they semantic or other, and that relational databases are systems to store and query data based on the relationships. Even a key value store defines an explicit relationship.
When I first learned about databases, the class was about Normal Forms and Relational algebra, with rdbms thrown in as an example. These 2 key concepts are methods of working with data in a subtle and nuanced way -- a way strongly based on set and/or category theory. SQL it always seemed to me to be a decent extension to this, a DSL specifically designed to work on sets vs individual pieces of data. In that I found it elegant. Now this doesn't mean SQL is perfect, I get that, in a lot of cases it is clunk or worse. Nor does it mean that the underlying datastore needs to model data to exactly follow the relationships that describe it. And in light of CAP vs ACID views on data, a lot of exploration of this space makes sense. None of this includes a concept of relational vs non-relational data. In fact I view these systems as potential new building blocks of relational sytems.
So this has turned longer than I originally thought, but badk to the main question: what is non-relational data?
I think it hinges on your definition of relational and where you want to cut off "what's reasonable."
Everything in a computer's memory is related by existing in a uniform address space.
Key-value relationships can be ordered implicitly or explicitly, either by describing an algorithm to sort the keys, or an algorithm that creates a new indexing by comparing the values.
Caching data creates a relationship with manually defined logic.
I think the right question is "how formalized are the relationships?" and a related one is "how reflective is the data?" Programmers have plenty of ways to access data both formally and informally - the informal methods are often closer to the machine, but remove context that would help to automate and verify the process. You can also have data that doesn't say much about its context, and data that is very heavily cross-referenced.
NoSQL, to me, seems to be mostly about relaxing the heavy formalization of SQL - to shortcut to simpler or even incomplete descriptions of data, and deal with the resulting fallout later, in the same way that dynamic languages eschew static checking. Good if your needs are simple, potentially dangerous if your data becomes complex.
From what I understand, some proponents of NoSQL are even more minimalist than that--they are fine with the idea of relations, just not with storing them in a traditional RDBMS (e.g. MySQL, Postgres, Oracle, etc). The most convincing argument that I see for this is traditional RDBMSs are all row-oriented and organized via relational theory, and that a lot of data is better stored and/or manipulated via column-oriented, or graph theory, or object oriented systems.
The row orientation of a sql database is a significant limitation for sufficiently deep datastructures IMO. They often force you into multiple calls for a single entity or complicated loops through really long rows. All of which can be the cause of wasted cycles in code. A sql db with its rows presents a poor datamodel for those cases.
You are confusing SQL with "necessitates row ortientation", even tho column databases frequently use SQL for queries. This is essentially why I originally asked -- SQL and relational data does not require what you describe, it has just been the traditional approach to implementing it.
I think when zaphar said "sql database" he was just using common usage for RDBMS. Yes, "SQL database" does not have to mean "row oriented storage". But all of the commonly familiar SQL RDBMSs use row oriented storage.
I actually think SQL the language assumes row oriented storage. I honestly can't think of a single SQL database that isn't row oriented. While theoretically you could, in practice no one has.
I can't say for certain but this might be related to the reliance on sql for the query language.
From what I understand, some proponents of NoSQL are even more minimalist than that--they are fine with the idea of relations, just not with storing them in a traditional RDBMS
I don't know about this. I would never use the word NoSQL... but I use object databases heavily and find them a better fit than relational databases for the majority of my work. However, I still store the physical data in a relational database. (Key => value.)
This is beneficial in a number of ways. You already know how to administer MySQL, you already know how to replicate MySQL, you already know how to back up MySQL, etc. This method also affords you the opportunity to reuse the database machinery in your application. You can extract attributes from objects as you store them, and later index/search them with the usual database querying infrastructure. The database does not know everything about the objects you are storing, it just knows a few key facts that you know you want to search on, so you can write efficient searches, but you are not limited by the usual relational weaknesses. (The "object-ness" is stored in a structure opaque to the database engine, like a JSON blob or something.)
As an example, consider cleaning out dead user objects that have not confirmed their email address after a certain number of days. If you were only using an object database, you would maintain a set of users to potentially expire. When they confirm their email address, you remove them from this set. Fine. The problem is the date constraint; you don't really want to scan the entire set of potentially-expireable users to expire users, you would like to be able to search for these objects efficiently.
That's easy to fix; when you store a user object, you can extract the confirmation status and registration date from the object and store those in real columns next to the opaque object data. Then you can "SELECT object_id FROM objects WHERE registration_date < NOW - 30 days AND confirmed_email IS NULL" and remove those objects from your system. Efficient, and it works as you change the structure of the user object, or add subclasses, etc.
I work with objectstore nearly everyday and wish we were using oracle or db2. I am firmly convinced that you better have a damn good reason (like a big heavily updated graph structure) to forgo the discipline of a normalized relational schema.
Use objects in memory. I'm just saying the relational <-> object mapping is a necessary evil because in the long run with complex data you'll suffer for your sin of skipping normalization.
Relational theory only requires row orientation at a conceptual level, and not even that -- a row is just a set with a key and potentially a relationship with another set. At a lower level than the relational algebra much data is better kept in columns or graphs, as you mention. This is why I ask my question, the data is conceptually relational (all data is pretty much by definition of relational...) but access patterns dictate a non-row orientation.
Well, I just think that a lot of data, most of it in practice, does not naturally fall into a table structure.
Social networks, heirarchies of inheritance, arbitrary relations between documents such as the web itself - most data is naturally a graph or tree structure.
So when you squeeze these kinds of data into tables, and do the right thing and normalize so you dont have duplicates etc, then you have a lot of SQL manipulations to get the data back out.
Relational I think has a specific meaning in terms of relational algebra the math behind normalising tabular data in 'Relational Databases' (tabular databases).
But obviously you need relations between items of data regardless of whether they are in a RDB, OODBMS or in a graph or tree structure.
I think the Key Value stores we see being so popular now have got it right, but there's some part missing - a good intuitive powerful query and update language [probably functional, definitely not SQL].
Id put it in the same category as RDF - largely unusable.
I think data is much wider than 'what can be stored in tables' - that's the whole problem.
Although all data could be stored in tables and accessed via SQL, it shouldn't be. I contend we need to optimize for the common case of tree / graph data and fidn something much better than SQL.
The relational model has a formal definition given by E.F. Codd that is founded in mathematical theory, more information about which can as always be found on Wikipedia.
"a relation is a data structure which consists of a heading and an unordered set of tuples which share the same type."
So consider something that does not meet that definition: data which you place under one heading, but do not share the same type. This certainly sounds like a document database, or a key-value store - where "value" is implicitly of an undefined type.
Interestingly but perhaps less relevant, an ordered set should not be considered relational, which is why for example you should not consider that adding a clustered index on a table will guarantee the order in which rows are returned by a query even though this may occur as a side effect of the underlying implementation.
Note also, that you can see most RDBMSs are - perhaps fortunately - fairly lax about limiting themselves to theory; see ORDER BY in your SQL implementation of choice.
There is certainly data you don't want to put into a relational model.
There are domains where it is essentially impossible to create a reasonable relational model. For example where a class references many, quite different, classes. In the relational world you need many kinds of different keys to many other tables and the table names. And even then you need to bounce out of the relational world to get the results. In the object world references are uniform.
There are a lot more cases where a reasonable relational model exists but it is not supported by the mature SQL databases. A very large matrix, a highly structured document, source code etc.
And there are more cases than that where it will be no longer be worthwhile to create an SQL model if the nosql databases mature. Such as when when an Object, JSON or XML model needs to exist. I think this is close to 100% of enterprise applications.
Representing polymorphism in a relational database always involves a hack. Computer programs operate on polymorphic data structures, so it is often annoying that the database cannot store this cleanly.
Great post, it is important to make clear that NoSQL is not just about performances. For instance now that I'm used to Redis lists and sets I feel strange when I'm using an SQL DB and I've a problem when to just pushing or popping stuff is the trivial thing to do.
Of course SQL databases are very useful and a great tool for many domains. It is also not a balanced view to think that SQL databases are to trow away. Sometimes the table-based data model, with the querying power of SQL, is just the way to go for many kind of problems, or in addition to a different kind of DB.
The term NoSQL is catchy but wrong. Some databases with strong theory backing them up, are fall under the NoSQL umbrella and others which are ad hoc and a huge step backwards are too.
I think the ultimate lesson we've re-learned is to use graph, hierarchical, and relational databases in an appropriate manner and to make engineering tradeoffs around consistency as needed. NoSQL is a crappy name for this lesson.
While the "NoSQL" moniker is catchy, misconceptions are inevitable so long as this category is defined more by what it isn't rather than what it actually is. What is NoSQL all about aside from scalability, performance, key/value access or support for non-relational data? This article doesn't clearly say.
What is non-relational data? I have been under the impression for a long time that datasets all have internal relationships, be they semantic or other, and that relational databases are systems to store and query data based on the relationships. Even a key value store defines an explicit relationship.
When I first learned about databases, the class was about Normal Forms and Relational algebra, with rdbms thrown in as an example. These 2 key concepts are methods of working with data in a subtle and nuanced way -- a way strongly based on set and/or category theory. SQL it always seemed to me to be a decent extension to this, a DSL specifically designed to work on sets vs individual pieces of data. In that I found it elegant. Now this doesn't mean SQL is perfect, I get that, in a lot of cases it is clunk or worse. Nor does it mean that the underlying datastore needs to model data to exactly follow the relationships that describe it. And in light of CAP vs ACID views on data, a lot of exploration of this space makes sense. None of this includes a concept of relational vs non-relational data. In fact I view these systems as potential new building blocks of relational sytems.
So this has turned longer than I originally thought, but badk to the main question: what is non-relational data?