The Death of the Relational Database

bumbledraven · on March 26, 2008

The author is misinformed in many ways. To pick an easy example, he says that "the relationship between objects is built into the objects", and as an example cites, "An invoice knows as part of its structure, who the customer is. That pointer to the customer is stored in the invoice."

But that doesn't have to be the case at all. It's common (in general, maybe not with customers and invoices) to store the relationship in a third table, say "customerinvoice", that has customerid and invoiceid fields.

stcredzero · on March 26, 2008

I agree the author is misinformed, though I also agree with him that relational databases suck, especially if you are doing object oriented programming. I'm not sure what is better, however. Also, most of the corporate world is firmly ensconced in the relational database mindset. This makes interacting with them difficult if you do not also "speak relational."

michaelneale · on March 26, 2008

Maybe its object oriented dogma that is a problem.

KiwiNige · on March 26, 2008

I've worked with a lot of accounting data, which fits the relational DB model really well. But I can't understand why I have to use an OO model to display it on a screen as a table of rows and columns with the odd field scattered here and there. It just seems like a lot of extra complexity to me when all the users want to see is SELECT * FROM TRANSACTION.... made to look pretty.

apathy · on March 27, 2008

Funny, that's what my users want, and the only object I use to give it to 'em is a CSV parser that happens to have an output method (which works really well).

Users are delighted that all they have to do is click and their Excel/SAS datasets fill up. They are dumbfounded when I show them the code and it is quite literally 4 lines long, and two of them are MIME headers.

Of course, they don't see the definition of the VIEW, but why should they? ;-)

michaelneale · on March 27, 2008

Indeed. OO is so ill defined anyway. What is documented as OO now I am sure is not what alan kay intended it to be. New popular frameworks like rails don't help either (much) - although its oo-lite - so its not so bad.

still, I am looking forward for an excuse to crack out stuff like arc and walk away from OO for a while (I am allowed to dream).

stcredzero · on March 27, 2008

I've been working with Smalltalk, which was Alan Kay's creation. Most implementations are not what he intended -- at one Smalltalk Solutions keynote he excoriated all of us. He said that he never intended Smalltalk to become a programming language. He wanted to create a Montessori toy for the mind. That said, I don't think that Smalltalk as a programming language is that far from what he intended. He just never intended it to stop there.

darose · on March 27, 2008

Doubtful. Object oriented models mirror the real world far more than normalized relational data.

So then ... maybe its relational dogma that is the problem. People are realizing that it is the source of the object-relational impedance mismatch, and is what needs to change.

gruseom · on March 28, 2008

Object oriented models mirror the real world

But that's only useful to the extent that programming is modeling something in the real world. Much computation doesn't.

Kaizyn · on March 27, 2008

This is because businesses learned painfully over many years that the relational model is the simplest way for them to store the data relevant to what they do in a way that is accessible to all of their IT systems that they have or that they might want to build. Relational databases, as they are currently implemented, might be getting long in the tooth now: http://lambda-the-ultimate.org/node/2500

However, given the relational model's foundation in set theory and predicate logic, it seems you'd be hard pressed to replace it with something else that can offer the same data integrity guarantees.

gibsonf1 · on March 27, 2008

If you do that (create a customerinvoice rel table) you still have to change your data model every time you want to track a different kind of relationship. We've just converted to a generic relationship table (subject relation object) to handle any kind of relationship, so we no longer need to keep adding tables and changing data models. This is beyond a huge time saver.

darose · on March 27, 2008

OMG!!!! I think there was a "Daily WTF" about this design!!! LOL!!!

greendestiny · on March 26, 2008

I think everyone has considered very lose table schemers before, built entirely out of foreign key relationships, mostly its not done because of performance. I do wonder if you built a database engine specifically around this paradigm if it would be fast enough.

edw519 · on March 26, 2008

The author is confusing the term "relational databases" with the implementations of relational database systems that he has encountered.

That's like looking at a bad Python program and saying, "Python sucks," or like saying, "I've never seen a car go more than 10 miles without breaking down; therefore cars are not reliable transportation."

You can store just about anything in a RDBMS pretty much any way you want. You're limited only by your own skill and imagination and the particular limitations of your vendor's implementation.

A better title would have been, "Here We Go Again: The Death of the Relational Database Prematurely Announced."

michaelneale · on March 26, 2008

Time for the obligatory: "rumours of RDBMS death are greatly exaggerated"

davidmathers · on March 26, 2008

Here's the comment I left on Hank's blog:

Hi Hank, I like to be brief so this might sound bad, but I mean everything in the friendliest possible way.

You and almost all the people arguing against you are wrong about almost everything. I don't mean to say anyone's opinion is wrong. I'm talking about basic understanding of what certain words mean.

Let me break it down. There are 3 models for "programming" (in the general sense) computers:

1. Functional 2. Relational 3. Imperative

The functional model can't store data and therefore can't be used to create a database. So there are fundametally only 2 kinds of databases.

A database created using the Relational model uses relations to both store and retrieve data, so lets call it a relational database. A database created using the Imperative model uses pointers & nodes to store data and pointer navigation to retrieve data, so lets call it a navigational database.

That's it. There are only 2 database models. Each model can be used to implement different kinds of databases based on the limits they place on the structure.

There are 2 primary kinds of navigational database: graph/network and tree/hierarchy. A filesystem for example is a tree/hierarchy database.

A relation is basically a truth table with columns that are related to each other by a truth statement and rows of truth values that fulfill the truth of the statement. A standard relational database doesn't place any limits the number of rows or columns. A binary database limits the numbers of columns to 2.

A SQL DBMS is a (partially successful) attempt to implement a language that can be used to create a database which uses relational model.

OK, the important parts:

1. The semantic web is an implementation of the relational model that limits the relations to 3 columns and a single row.

2. Just because you use a SQL DBMS to create a database doesn't mean you actually created a relational database. You can put pointers in your tables, turning your relations into nodes on a graph, turing your database into a navigational database with some relational features.

Much of what you said in your original post was exactly backwards. You said "relational sucks" but the things you described as problems were features of navigational databases, not relational. Then you said "the semantic web is awesome because it's a navigational database" when in fact it's a relational database.

That's all.

jules · on March 26, 2008

> Let me break it down. There are 3 models for "programming" (in the general sense) computers: > > 1. Functional 2. Relational 3. Imperative

Could you explain why you picked these three models?

Do you mean relational in the sense that you don't have input-output functions, but relations like:

    plus(4,4,x) => x = 8
    plus(y,4,6) => y = 2

Why can you store data with a relational model store but not with a functional model?

davidmathers · on March 26, 2008

Functional and relational are 2 sides of the same coin. For instance:

x + y = z

Can be viewed as either a function called binary_addition with x & y as inputs and z as the output, or as the description of a relationship between 3 sets.

So, to solve the problem using SQL for example you would create a table binary_addition(x,y,z) and then fill it with all the true values that caused x + y = z to be true and then say "select z from binary_addition where x = 4 and y = 4"

In the functional model the computer stores the process for turning the input values you give it into the output values you want. In the relational model the computer stores a table containing all known possible input values, all known possible output values, and how they relate to each other, and gives you a way to retrieve them.

Note, the functional model is implemented by what that famous guy whose name I can't remember called "function-level" programming languages, not functional (aka lambda) languages.

davidmathers · on March 26, 2008

The famous guy is named Robin Milner. He created the ML language. In lambda languages like lisp or javascript you can use functions as arguments and return values. I function-level, or applicative, languages you can only create new functions by combining existing functions. I think. I've never used a function-level language.

pius · on March 26, 2008

The functional model can't store data and therefore can't be used to create a database.

That's not quite right. You can persist data in functions, it's just not done as often.

giardini · on March 27, 2008

The article is worthless: not a single sentence of the first 5 paragraphs of the blog post makes sense when examined critically. The author does not understand relational databases nor how flexible they are. In fact I'm fairly certain he understands neither OOP, nor RDBMS, nor the Semantic Web (of which I am no proponent) well if at all.

Every now and then a developer community bubbles over with complaints about RDBMS and gets some attention. Most support is from people who, like the author, understand OOP to a certain degree but don't understand RDBMS.

And time after time predictions of the death of the relational database model prove wrong: RDBMS usage only increases. The relational database model supplanted the network database model (which corresponds to the "graph databases" the author speaks of) for good reasons.

Nothing to see here: keep moving folks.

vixen99 · on March 27, 2008

Am I alone in finding your comment unncessarily unpleasant? Instead of offering unilluminating perjorative rhetoric why don't you provide even one example of why you take issue with the article? Perhaps you're right about it but you give no reason for supposing this to be so. Also, how about letting the well-worn 'move along now' cliche enjoy a well-deserved rest?

giardini · on March 27, 2008

Perhaps not.

No, I believe it makes sense to draw a line. When a person (indeed, even a specialist) claims special insight (including critical insight) of a well-examined problem he is almost always wrong. The physicist John Baez not infrequently encounters cranks who believe they've found errors in relativity theory or quantum mechanics. He has developed a scale for rating cranks:

http://groups.google.ca/group/sci.physics/msg/5312a801e0785e...;

The OP is wrong in so many ways that it renders his article meaningless. And others have pointed out possible errors (although doing so one must interpret the OP's intentions, a risky endeavor indeed), though certainly not to exhaustion. To add a single specific item of criticism to the fray would only provide yet another handle for the OP or other misled persons to grasp and extend the discussion uselessly.

The human mind can create ideas, phrases, and analogies some of which, upon further examination, are devoid of meaning. Dreaming is an extreme instance wherein most of the ideas later make no sense. However the same thing can happen while fully conscious and is part of the normal creative process.

Mathematics and logic are tools we use for separating empty ideas from useful and meaningful ones. Unfortunately there is no Royal Road to mathematics or logic, nor to relational databases:

http://en.wikipedia.org/wiki/Royal_Road

I have neither the time, nor the inclination, much less the rhetorical skill to enlighten the OP or this group as to the vagaries of databases.

Nor do I view this as a "rhetorical" discussion: rhetoric is concerned with swaying the populace to your side of the argument whether you are correct or not. I am concerned about what is correct rather than what is popular.

I do not doubt the enthusiasm (or frustration) of the OP, however his complaints are poorly-stated, unclear and orginate from an incomplete understanding of logic and relational databases. Many similar complaints have been stated before (often much more clearly and in a form arguable) in more appropriate venues (e.g., Google for "relational vs OOP group:comp.*"), where they have been thrashed about thoroughly by better, and worse, men than me.

It is one thing to register frustration. But it is another to casually question ideas that have withstood the test of time and cast that questioning as serious.

To show that frustration in the development of databases is nothing new see William Kent's "Data and Reality":

http://www.amazon.com/Data-Reality-William-Kent/dp/158500970...

tx · on March 27, 2008

I am surprised by so many "author is confused" and "author is misinformed" responses. Can't you guys operate on a higher level of abstraction? Aren't we all here dynamic language lovers?

Just listen to what he says more closely, because he is essentially suggesting that strong typing is bad for databases for exact same reasons it's bad for programming languages. Yes, it makes things faster, more efficient, robust but ... (surprise!) less flexible and dumb.

"Duck-typed storagebases" are indeed the future and perhaps the number of negative reactions is the best indicator of how novel the idea is.

I am in disagreement with "semantic web" movement (in my opinion it's already semantic enough), but the storage part is spot on.

sant0sk1 · on March 26, 2008

"Now, along comes the semantic web just in time to make us all feel really dumb again."

This happens to me almost daily, and I love every minute of it...

BrandonM · on March 26, 2008

For example, imagine starting out with a contact list. Some months later, you add a restaurants list. Some months later again, you decide it would be great to be able to capture, for each contact, what their favorite restaurants are. Ideally one would want to just establish a “favorite” relationship between a restaurant and a contact without changing the restaurant structure or the contact structure.

Let's look at this example in a relational database. Personally, I've never even implemented a database, but I did take one class on relational databases. The way I learned it, the contacts would be one table and the restaurants would be another table. A third table, let's call it "FavoriteRestaurant" would have two columns [1]: a foreign key to an entry in the Contacts table and a foreign key to an entry in the Restaurants table. The primary key in this table would have to be the contacts column, since restaurants would appear more than once. If each person can appear in the FavoriteRestaurant table more than once (multiple favorite restaurants), then both columns would have to serve as the primary key.

Thus, we have managed to effectively utilize a relational database to express a new relationship, without ever changing the original data. The author said:

Most relational databases actually have an upper limit on the types of objects, typically referred to as tables, which can be handled. Too many tables in a database schema is considered bad design.

If that is indeed the case, that is where the problems lies. I am far from being a champion of relational databases, but it seems to me like a lot of people don't think critically about how best to store the information in their databases. More tables, in my mind, is a good thing.

[1] The number of columns would actually be equivalent to the sum of the number of items in the primary keys of both the Contacts and Restaurants tables. With unique identifiers (SSNs and vendor IDs, for example), of course, this would indeed be two columns.

pius · on March 26, 2008

A third table, let's call it "FavoriteRestaurant" would have two columns [1]: a foreign key to an entry in the Contacts table and a foreign key to an entry in the Restaurants table.

Yup, you could definitely express a new relationship that way. I think the point is that adding join tables like that has traditionally been considered an anti-pattern for relational databases because it increases duplication and denormalizes the data, thus working against the supposed performance gains and data integrity protection from using the RDBMS in the first place. If this is wrong, please do correct me.

One major difference I've noticed between document/graph-oriented databases and relational ones is that they embrace denormalization and even optimize for it insofar as that's possible.

LogicHoleFlaw · on March 26, 2008

adding join tables like that has traditionally been considered an anti-pattern for relational databases because it increases duplication and denormalizes the data

I thought that join tables express a normalization of the data? You are then not storing restaurant data explicitly as a column in the Contact table, which reduces data duplication and gives you more fine-grained control over your structure.

Join tables (especially reflexive ones) gave me a bit of a headache when I first started working with SQL databases. Once I finally wrapped my head around them I started seeing a lot of uses for and advantages of them. However, I've had little formal training in database techniques; only a little bit of relational algebra. Is there something I'm missing here?

pius · on March 26, 2008

That sounds very plausible and, indeed, I have join tables throughout my apps. This is why I put the proviso in, "correct me if I'm wrong." :)

My understanding was that hardcore relational database guys would say that join operations are necessary when the data's totally denormalized, but having a join table wasn't necessarily a best practice because now you've got an additional table that could potentially get out of sync.

edw519 · on March 26, 2008

"If this is wrong, please do correct me."

It's not right or wrong. It's a trade-off, a design decision.

Just because a tool can do something, doesn't mean it should in every case.

Normalize as far as it makes sense for your app.

pius · on March 26, 2008

The "please do correct me" part was in regards to my characterization of the arguments that RDBMS gurus make against join tables. I wasn't asking you to judge my architectural decisions, thanks. ;)

I think it's besides the point whether or not an RDBMS can handle a given app's data; it almost certainly can. The real question is should people dogmatically choose an RDBMS for every single data persistence problem they need to solve.

While the article's title is obviously hyperbole, I think the dissent against choosing the RDBMS model of storing knowledge is a good one. I see the decision that system architects are faced with here as being an end-to-end argument: should the protections and optimizations provided by relational databases be enforced at such a low level or are dumb databases that delegate those features to other layers better design? There's decent evidence for the latter.

edw519 · on March 26, 2008

What you call "hyperbole", I call flamebait.

"Choosing the right tool for the job" does not equal "the old tool is dead".

pius · on March 26, 2008

"Choosing the right tool for the job" does not equal "the old tool is dead".

Can't argue with that.

demallien · on March 27, 2008

Exactly. It's a bit like Blub in the database space. Sure, you can do anything in a relational database, but there are certain problems that are better handled other ways.

Relational databases have been optimised to do a certain task very well: retrieve information very rapidly from a large dataset, that has well-defined, relatively static, relationships between entities. They were designed for things such as storing government census data, tracking customer details for very large corporations, insurance data etc etc. You may like to think of relational databases as being the C equivalent in the database space: relatively low-level, fast.

But there is a new type of database out there. It is more human in it's scale, with tables that might only have thousands, rather than millions, of entries. But the data has very fluid relationships between entities: Hank's example in the comments here about wanting to track a contact's favourite restaurant is a classic example. We might also want to track the contact's favourite film, their car model, and where they last went for their holidays. A bit later we may also consider it really important to know what brand of dishwashing detergent they use.

This sort of problem is not well handled by relational databases. They aren't optimised for it. They are optimised for speed over large datasets with static relationships. To get fluid relationships, you need to use join tables, which have a large size cost, and which degrade the reliability of the database. But speed is not typically a problem for these types of problems. There are lots and lots of small tables, instead of a few large tables. Manually examining every item in the table to find one that matches your criteria is not necessarily a prohibitively expensive operation.

Thinking about this kind of problem from a Rails perspective, for example, you might decide to modify ActiveRecord such that its has_one, belongs_to style declarators, instead of working on static fields in a table, instead attack one table for the object, containing a bunch of tuples that look like {"relationship_name", table_id, [item_id1, item_id2,...]} Operations on the relationships between objects modifies this table, rather than the actual tables that contain object information. As such, the structure of links between tables becomes just another table. It wouldn't be as fast as a relational database, but for this type of task, it is far better suited, as it would be far more expressive - a new relationship between objects can be created just by inserting a new row into the table, and is hence modifiable programatically. It's the lisp of the database world.

edw519 · on March 26, 2008

"Most relational databases actually have an upper limit on the types of objects, typically referred to as tables, which can be handled. Too many tables in a database schema is considered bad design."

By whom? The number of tables in the schema should be what the app calls for, no more, no less. That's like saying, "Too many lines of code in a program is considered bad programming."

I don't know what implementation's OP has seen, but GOOD RDBMS's are limited only by hardware.

Personally, I have occasionally encountered apps whose best approach was not RDBMS, but I have never run in to a data problem that couldn't be handled by a RDBMS.

BrandonM · on March 26, 2008

If I was downvoted because I said "More tables... is a good thing," I want to clarify that I meant in terms of properly abstracting relationships away from the data, not in terms of performance. If that's not why I was downvoted, then I'm curious to know why.

pius · on March 26, 2008

I don't know why someone would downmod you for that comment, but I threw an upmod your way to try to even things out.

mattrepl · on March 27, 2008

I'm surprised no one brought up column-oriented databases.

When attributes can be added dynamically and/or are have a small value set, column-oriented databases with bitmap indices outperform traditional relational databases.

See MonetDB for an open source, usable column-oriented db with bitmap index support: http://monetdb.nl/

For more kvetching on the topic: http://dynamictyping.org/post/29661699#disqus_thread

I tried to find a public research document with performance comparisons, but to no avail.

workpost · on March 26, 2008

Anybody ever heard of Netezza? I was told about it by someone who works with vast amounts of data - many terabytes every day. It's part of a hardware system that uses overwhelming processor power to segment and blast through data. You don't need the traditional headers or relational database structures to search like this. The technology is costly now (it's specialized hardware) but if you want to talk about where databases are going beyond the relational, this is one direction.

joshwa · on March 27, 2008

http://en.wikipedia.org/wiki/Entity-Attribute-Value_model

andrewparker · on March 26, 2008

If we give up relational databases in favor of a graph model, I'm sure we'll have piles of blog posts complaining about the sacrifices made in that switch. That said, the relational database is ancient technology that was built for an entirely different purpose than for what it's used today, so I eagerly anticipate a revision.

sah · on March 26, 2008

This paper discusses some related reasons why traditional relational databases are poorly suited to some modern applications: http://www.vldb.org/conf/2007/papers/industrial/p1150-stoneb...

pistoriusp · on March 26, 2008

Whydoeseverythingsuck.com? Probably because you're asking the wrong questions...

hank777 · on March 26, 2008

I am the author of this article, and I must say, though I am not a regular reader of HN, I do find it refreshing that there are smart people here arguing real merits in an intelligent respectful way. I am sure it is not always the case, but it is really fun to read everyones perspective.

I dont think graph databases are for everything, but I do think that they will end up providing a much better abstraction for the kinds of apps we tend to write on the web. I do think an RDBMS is better for an accounting system for example. Oh, and my examples were not designed to actually be great real world examples, but I have a lot of less technical people reading my blog and so my goal was to provide examples that could be expressed succinctly. That said, there is no static example that cannot be expressed in a relational database. The problem is that relational databases (at least the ones that are available to us) are not at all fluid and flexible.

edw519 · on March 27, 2008

Hi Hank. I find it refreshing that authors of popular articles here have thick enough skin to be willing to engage the crowd. Welcome!

You said, "The problem is that relational databases (at least the ones that are available to us) are not at all fluid and flexible." Since I disagree with this and find you so eloquent in your arguments, I wonder if this debate is over "semantics". I don't know what you think is "available to us", but I have worked for ages with RDBMS that are so stunningly fluid and flexible, they have never failed to deliver what I needed for any app. Perhaps you haven't had the same opportunity (and joy). Remember, just because it's relational doesn't mean it has to be Microsoft, Oracle, or open source.

http://www-306.ibm.com/software/data/u2/

http://www.jbase.com/

http://www.rainingdata.com/products/dbms/index.html

http://www.revelation.com/

gregjor · on March 30, 2008

All of the products you list are variations of the original multivalued database designed by Dick Pick. These are often referred to as Pick- or Pick-style databases. Back in the 1970s and early 1980s there were several large vendors selling computers and operating systems based on the Pick database complete with a dialect of BASIC with the database functionality embedded. All of those companies went out of business a few years after the first commercial relational databases arrived on the market: Prime Computer, Microdata, Ultimate (owned by Honeywell), and a few others. IBM bought the most popular Pick clone, Universe, and renamed it U2. Raining Data is the result of a merger between what was left of Dick Pick's company and another non-relational desktop database, Omnis. jBase is yet another Pick clone.

I worked as a Pick consultant in the 1980s, mainly on Prime/Information systems. Now when I run into Pick databases still in use they are always undergoing or slated for replacement. The replacement is always a modern commercial RDBMS such as Oracle or MS SQL Server.

Technically Pick-style multivalued databases are not relational, but they can be made to act more or less like a relational database. The most important difference is the support for multivalued fields. Originally multivalued fields were the big selling point, and the Pick database engine is built around nested multivalued lists (a very different internal organization compared to, say, Oracle). Multivalues of course violate First Normal Form; a relation that includes a multivalued attribute cannot be said to be normalizable.

Pick-style multivalued databases violate the relational model (the model based on sets and predicate logic) in other ways that I won't get into; Chris Date and Fabian Pascal have written at some length on the subject.

That isn't to say this database model is useless or wrong, just that it isn't relational in a strict sense. I've written and worked on large applications built on the Pick database and, while I would not choose those tools today, they are certainly powerful and flexible enough to build real applications on. The flexibility Pick adherents enjoy has a dark side in that data integrity must, for the most part, be enforced in application code; Pick-style databases are especially prone to dangling keys and type mismatches. Pick-style database programming requires the application programmer to lock records (rows) explicitly; support for ACID-compliant transactions are non-existent or afterthought bolt-ons. Those can be pretty big problems for modern application architectures where the client and server are not running on the same minicomputer (a la Microdata and Prime).

Building a single application with a multivalued database can be "fluid and flexible" and maybe even faster than starting with a true relational database, but when multiple applications have to share a multivalued database and the integrity rules are therefore scattered across application domains and code things can get messy pretty fast. Anyone who has spent time migrating Pick applications to any other platforms knows how easily the mix of business logic and database manipulation in the same piece of code makes for a big bowl of spaghetti.

mixmax · on March 26, 2008

"my examples were not designed to actually be great real world examples, but I have a lot of less technical people reading my blog"

This is definitely the place to give a real world example if you have one handy. Most people here would understand it, we would love the discussion, and you might even get some good feedback.

And thanks for posting here :-)

hank777 · on March 26, 2008

Well, as I said, there is nothing that cant be expressed in an RDBMS - at least at first. But let me give an example of the kind of use cases we see.

First, imagine having a database that allows one to freely create record types. One might have standard data types like contacts, events, emails, checks, expense reports, etc.

These record types are nodes. Now imagine being able to connect these nodes using any type of edge you like. For example a contact might be connected to an event as an "invitee". Thats how the edge would be labeled. Now the relational folks will say that that is a relationship that could be predicted. But at some point, some new type of record is created. And you as a user want to connect that record to existing records. For example you have added a "shoe" record type to keep track of all of your shoes. You then decide you want shoes to be connected to events so that you can map what shoes you wore to what events. You don't want to modify your schema. You don't want to add a new mapping table, you just want to connect the record. And you want to be able to query the graph for all the things of any type that are connected to that record. More importantly, you want the end user to be able to decide that it would be useful to connect shoes to events since no self respecting programmer is ever going to design such a system.

This is the type of flexibility that you need in a web application that will evolve over time. But the minute you want to connect that new record type to the existing object, you either have to modify your schema, or you have created a database that is highly flexible via totally generalized mapping tables, but is not optimized for these kinds of structures. For example just creating a giant mapping table to connect objects will work in an RDBMS but it is not at all optimized and will fall over at scale. Since we are building something that will handle awesome scale, using an RDBMS in this way was a non-starter. Philosophically, we probably have more in common with Google BigTable than with an RDBMS.

staticshock · on March 27, 2008

you either have to modify your schema, or you have created a database that is highly flexible via totally generalized mapping tables, but is not optimized for these kinds of structures

A generalizable mapping schema with tables for edges may not be optimal, but your comparison seems to be a bit of a bait-and-switch. Why compare the optimality of such a schema to a rigid schema instead of comparing it to the optimality of an alternative "graph-based" data store?

Granted, an extensible schema will be slow to query/etc. What makes you say that you can achieve better efficiency using a non-RDBMS approach? (Not that you can't, but I didn't see your argument to that effect. I'd say that without such an argument, the optimality/speed point is unsignificant.)

wheels · on March 27, 2008

You can. Definitely. In fact, I've implemented this a few times (most recently last week); some for specific problems, some for more generic graph support.

In a nutshell:

The problem with RDBMS approaches is that the good ones assume you can pack your complex logic into a monster query or stored procedure and let the query optimizer do its thing. But if you're implementing an attribute-value system or graph traversal on top of an SQL database, you end up generating a ginormous number of queries just to do some basic traversal. You could potentially wrap those into a stored procedure that was doing selects into a temporary table, but that's not really the sort of thing that most query optimizers go to town on.

On the other hand, there are a number of systems out there that either attempt to be full object oriented databases, or object relational mappings, or RDF based stores, but the current off of the shelf ones tend to perform poorly since they're not very mature (and I get the feeling are more focused on just being able to conveniently store stuff, not actually hitting it very hard).

When I first started looking at the sort of problems that Hank's addressing (in a series of talks I did in 2004 titled "Beyond Hierarchical Interfaces") I naïvely thought that you could do everything with an SQL backend, tried and failed. I could blab on about the sort of indexing that you need for these sorts of storage, but I'll duck out for now.

Edit: Just one example of where I've done this, if anyone cares, was replacing the old SQL backend with a dynamic (schema-less) attribute-value system and basic query language, for my current job: http://grunge-nouveau.net/Kore.mp4

staticshock · on March 27, 2008

Now, I may be pretty naive here, but if you're doing full on graph traversal, why not just extract the full graph from the database and traverse it in memory on your own terms instead of leaving it to large unoptimized traversal queries?

wheels · on March 27, 2008

For the latest data set that I'm working on there are 5 million nodes and 50 million edges, and each one has some meta-data associated with it. :-)

wanorris · on March 27, 2008

Good points. I've run into this problem a lot, and generally handle it in one of 2 ways:

1. Make it easy for semi-technical project managers who are not coders to extend the schema. This is the 95% solution for us, and the core of our system.

2. use n-to-n lookup tables or lookup fields that use a second field to determine what you reference. We don't do this a lot, but we do it in a few places where there can be a more or less unbounded set of things that can be referenced. These indeed have problems, so we try to avoid them, especially in high-volume situations.

Then again, note that this solution (a) requires using our framework to be effective (b) has RDMS purists seeing red. So maybe you're right.

Edit: section redacted.

bayareaguy · on March 27, 2008

I've seen plenty of graph problems solved easily with relational databases. However a lot of people seem to confuse the issues here (e.g. when they see "relational" they assume a SQL RDBMS).

Can you provide a better explicit reference for what you're calling a "graph database"?

Tichy · on March 27, 2008

I don't think "semantic web" makes for a database. What a database provides is fast access to the data, with indexes and stuff. You will still need to process the semantic web to make it accessible in a fast way.

wheels · on March 27, 2008

Hank, could you drop me a mail? (One click away from my profile...) I might have some interesting stuff to fling in your general direction in the near future.