Hacker News new | past | comments | ask | show | jobs | submit login

I am the author of this article, and I must say, though I am not a regular reader of HN, I do find it refreshing that there are smart people here arguing real merits in an intelligent respectful way. I am sure it is not always the case, but it is really fun to read everyones perspective.

I dont think graph databases are for everything, but I do think that they will end up providing a much better abstraction for the kinds of apps we tend to write on the web. I do think an RDBMS is better for an accounting system for example. Oh, and my examples were not designed to actually be great real world examples, but I have a lot of less technical people reading my blog and so my goal was to provide examples that could be expressed succinctly. That said, there is no static example that cannot be expressed in a relational database. The problem is that relational databases (at least the ones that are available to us) are not at all fluid and flexible.




Hi Hank. I find it refreshing that authors of popular articles here have thick enough skin to be willing to engage the crowd. Welcome!

You said, "The problem is that relational databases (at least the ones that are available to us) are not at all fluid and flexible." Since I disagree with this and find you so eloquent in your arguments, I wonder if this debate is over "semantics". I don't know what you think is "available to us", but I have worked for ages with RDBMS that are so stunningly fluid and flexible, they have never failed to deliver what I needed for any app. Perhaps you haven't had the same opportunity (and joy). Remember, just because it's relational doesn't mean it has to be Microsoft, Oracle, or open source.

http://www-306.ibm.com/software/data/u2/

http://www.jbase.com/

http://www.rainingdata.com/products/dbms/index.html

http://www.revelation.com/


All of the products you list are variations of the original multivalued database designed by Dick Pick. These are often referred to as Pick- or Pick-style databases. Back in the 1970s and early 1980s there were several large vendors selling computers and operating systems based on the Pick database complete with a dialect of BASIC with the database functionality embedded. All of those companies went out of business a few years after the first commercial relational databases arrived on the market: Prime Computer, Microdata, Ultimate (owned by Honeywell), and a few others. IBM bought the most popular Pick clone, Universe, and renamed it U2. Raining Data is the result of a merger between what was left of Dick Pick's company and another non-relational desktop database, Omnis. jBase is yet another Pick clone.

I worked as a Pick consultant in the 1980s, mainly on Prime/Information systems. Now when I run into Pick databases still in use they are always undergoing or slated for replacement. The replacement is always a modern commercial RDBMS such as Oracle or MS SQL Server.

Technically Pick-style multivalued databases are not relational, but they can be made to act more or less like a relational database. The most important difference is the support for multivalued fields. Originally multivalued fields were the big selling point, and the Pick database engine is built around nested multivalued lists (a very different internal organization compared to, say, Oracle). Multivalues of course violate First Normal Form; a relation that includes a multivalued attribute cannot be said to be normalizable.

Pick-style multivalued databases violate the relational model (the model based on sets and predicate logic) in other ways that I won't get into; Chris Date and Fabian Pascal have written at some length on the subject.

That isn't to say this database model is useless or wrong, just that it isn't relational in a strict sense. I've written and worked on large applications built on the Pick database and, while I would not choose those tools today, they are certainly powerful and flexible enough to build real applications on. The flexibility Pick adherents enjoy has a dark side in that data integrity must, for the most part, be enforced in application code; Pick-style databases are especially prone to dangling keys and type mismatches. Pick-style database programming requires the application programmer to lock records (rows) explicitly; support for ACID-compliant transactions are non-existent or afterthought bolt-ons. Those can be pretty big problems for modern application architectures where the client and server are not running on the same minicomputer (a la Microdata and Prime).

Building a single application with a multivalued database can be "fluid and flexible" and maybe even faster than starting with a true relational database, but when multiple applications have to share a multivalued database and the integrity rules are therefore scattered across application domains and code things can get messy pretty fast. Anyone who has spent time migrating Pick applications to any other platforms knows how easily the mix of business logic and database manipulation in the same piece of code makes for a big bowl of spaghetti.


"my examples were not designed to actually be great real world examples, but I have a lot of less technical people reading my blog"

This is definitely the place to give a real world example if you have one handy. Most people here would understand it, we would love the discussion, and you might even get some good feedback.

And thanks for posting here :-)


Well, as I said, there is nothing that cant be expressed in an RDBMS - at least at first. But let me give an example of the kind of use cases we see.

First, imagine having a database that allows one to freely create record types. One might have standard data types like contacts, events, emails, checks, expense reports, etc.

These record types are nodes. Now imagine being able to connect these nodes using any type of edge you like. For example a contact might be connected to an event as an "invitee". Thats how the edge would be labeled. Now the relational folks will say that that is a relationship that could be predicted. But at some point, some new type of record is created. And you as a user want to connect that record to existing records. For example you have added a "shoe" record type to keep track of all of your shoes. You then decide you want shoes to be connected to events so that you can map what shoes you wore to what events. You don't want to modify your schema. You don't want to add a new mapping table, you just want to connect the record. And you want to be able to query the graph for all the things of any type that are connected to that record. More importantly, you want the end user to be able to decide that it would be useful to connect shoes to events since no self respecting programmer is ever going to design such a system.

This is the type of flexibility that you need in a web application that will evolve over time. But the minute you want to connect that new record type to the existing object, you either have to modify your schema, or you have created a database that is highly flexible via totally generalized mapping tables, but is not optimized for these kinds of structures. For example just creating a giant mapping table to connect objects will work in an RDBMS but it is not at all optimized and will fall over at scale. Since we are building something that will handle awesome scale, using an RDBMS in this way was a non-starter. Philosophically, we probably have more in common with Google BigTable than with an RDBMS.


you either have to modify your schema, or you have created a database that is highly flexible via totally generalized mapping tables, but is not optimized for these kinds of structures

A generalizable mapping schema with tables for edges may not be optimal, but your comparison seems to be a bit of a bait-and-switch. Why compare the optimality of such a schema to a rigid schema instead of comparing it to the optimality of an alternative "graph-based" data store?

Granted, an extensible schema will be slow to query/etc. What makes you say that you can achieve better efficiency using a non-RDBMS approach? (Not that you can't, but I didn't see your argument to that effect. I'd say that without such an argument, the optimality/speed point is unsignificant.)


You can. Definitely. In fact, I've implemented this a few times (most recently last week); some for specific problems, some for more generic graph support.

In a nutshell:

The problem with RDBMS approaches is that the good ones assume you can pack your complex logic into a monster query or stored procedure and let the query optimizer do its thing. But if you're implementing an attribute-value system or graph traversal on top of an SQL database, you end up generating a ginormous number of queries just to do some basic traversal. You could potentially wrap those into a stored procedure that was doing selects into a temporary table, but that's not really the sort of thing that most query optimizers go to town on.

On the other hand, there are a number of systems out there that either attempt to be full object oriented databases, or object relational mappings, or RDF based stores, but the current off of the shelf ones tend to perform poorly since they're not very mature (and I get the feeling are more focused on just being able to conveniently store stuff, not actually hitting it very hard).

When I first started looking at the sort of problems that Hank's addressing (in a series of talks I did in 2004 titled "Beyond Hierarchical Interfaces") I naïvely thought that you could do everything with an SQL backend, tried and failed. I could blab on about the sort of indexing that you need for these sorts of storage, but I'll duck out for now.

Edit: Just one example of where I've done this, if anyone cares, was replacing the old SQL backend with a dynamic (schema-less) attribute-value system and basic query language, for my current job: http://grunge-nouveau.net/Kore.mp4


Now, I may be pretty naive here, but if you're doing full on graph traversal, why not just extract the full graph from the database and traverse it in memory on your own terms instead of leaving it to large unoptimized traversal queries?


For the latest data set that I'm working on there are 5 million nodes and 50 million edges, and each one has some meta-data associated with it. :-)


Good points. I've run into this problem a lot, and generally handle it in one of 2 ways:

1. Make it easy for semi-technical project managers who are not coders to extend the schema. This is the 95% solution for us, and the core of our system.

2. use n-to-n lookup tables or lookup fields that use a second field to determine what you reference. We don't do this a lot, but we do it in a few places where there can be a more or less unbounded set of things that can be referenced. These indeed have problems, so we try to avoid them, especially in high-volume situations.

Then again, note that this solution (a) requires using our framework to be effective (b) has RDMS purists seeing red. So maybe you're right.

Edit: section redacted.


I've seen plenty of graph problems solved easily with relational databases. However a lot of people seem to confuse the issues here (e.g. when they see "relational" they assume a SQL RDBMS).

Can you provide a better explicit reference for what you're calling a "graph database"?


I don't think "semantic web" makes for a database. What a database provides is fast access to the data, with indexes and stuff. You will still need to process the semantic web to make it accessible in a fast way.


Hank, could you drop me a mail? (One click away from my profile...) I might have some interesting stuff to fling in your general direction in the near future.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: