Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Tree structure query with PostgreSQL (truongtx.me)
149 points by dpeck on Feb 26, 2015 | hide | past | favorite | 31 comments


I've been playing with some more advanced feature in postgres (lately, mostly json and jsonb related), and one of the main aspect I'm concern about is the potential pitfalls -- specifically performance pitfalls. Mostly because I feel like there are too many magics around, and I have no idea how much overhead for any action is.

The advanced features allow you to have more than one way to do things, sometimes in a slightly more concise way (I've been tempting to use jsonb type in place of simple many-to-many tables if one of the values are just always be string). It's not that I'm expecting my app to be web scale or anything, but I'd rather not having to redesign everything one a query choke on a million records. And most of the docs tend to be light on performance characteristics.

As performance is one of the most important aspect in database system in general (which I'd say on par with consistency, and way a head of features), it'd be nice to have docs focus more on not just how to use features, but when it shouldn't be used.


Your concern is valid. Whenever complexity is abstracted away you may suddenly introduce an exponential order algorithm in an otherwise simple looking query.

Having said so, Postgresql has actually surprised me very little throughout time. The planner's "reasoning" is quite comprehensible. Inevitably I trigger a slow plan here and there but, once identified, correction is usually straightforward. My overall opinion is that postgresql has quite solid theoretical foundations and it shows.

What I mean is that, in advanced stuff, there'll always be some "magic", and that postgresql has the least amount of black box magic I've ever seen on a RDBMS (I'm looking at you, Oracle).

(Postgresql has been my non-clustered database of choice since about 2004. I'm definitely biased.)


What do you mean by "magic" and "no idea how much overhead for any action is". EXPLAIN is your friend.


I think this is a valid point. For non-performance-critical code where readability is more important (eg analytics) the additional language features could be nice.

In my experience, 9/10 production use cases are covered using a path as primary key with prefix matching (select path,data from tree where path='a/b' or path like 'a/b/%') - as a side effect your subtree nodes will often be on the same table page (mySQL by default, pgSQL when using CLUSTER).


If you ever need web scale, there is always nosql to fall back on Kappa


Interesting article! By the way, SQLite also has support for recursive queries (and allows even slightly more advanced stuff than Postgres, e.g. ordering of results), which you can use to do many amazing things.

I recently wrote a blog article about this, where I use recursive common table expressions solve the "eight queens" problem in SQL (using both Postgres and SQLite):

http://www.andreas-dewes.de/en/2015/queens-of-the-data-age-a...


I am not sure of performance. What I do know is that Vadim Tropashko's SQL Design Patterns has a few ideas on how to encode a materialized path see http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.102... . Also, bojanz and me have ideas here, see http://bojanz.wordpress.com/2014/04/25/storing-hierarchical-... and I know that these are as fast as they can be.


This chapter seems to give more detail on "Nested Intervals Tree Encoding": https://vadimtropashko.files.wordpress.com/2011/07/ch5.pdf

Via a comment from what seems to be from Vadim Tropashko, here: https://communities.bmc.com/docs/DOC-9902

The book itself seems to be found here (where "trees in sql" is found in chapter 5): http://rampant-books.com/book_0601_sql_coding_styles.htm

I have no connection to Vadim Tropashko, just interested in knowing more about "Nested Intervals Tree Encoding".


That ch5 pdf is from the book I mentioned, I didn't know Mr. Tropashko made it available for free, great! (We have conversed briefly in 2006 when I was designing the menu tree system for Drupal 6 and he sent me the draft of that chapter.)


That's the method that Disqus used as well.[1] Later on they used a materialized path solution [2], but I remained impressed that the recursive query worked so well at their scale, enough so that I used the same method in a personal project that had a hierarchical comment system. Happily, it wasn't much of a hack to do this in Django... the raw query support is quite good; you don't have to sacrifice the ORM to do something like this.

[1] http://cramer.io/2010/05/30/scaling-threaded-comments-on-dja...

[2] http://cramer.io/2012/04/08/using-arrays-as-materialized-pat...


To toot my own horn, I wrote a blog post that takes their approach and extends it to sort sibling comments by upvotes, like on HN or Reddit:

http://illuminatedcomputing.com/posts/2014/09/postgres-cte-f...

It's trickier than you might expect!


Oh awesome, my solution to that is pretty similar. I also added a couple window functions in the final SELECT statement to get each comment's sibling count and position, which my application code uses to do pagination. It's definitely very puzzle-y stuff. Time consuming, but fun when you get a nice solution!


I have never had great performance with recursive queries in SQL Server, does postgre perform better in this respect?

I have seen a few posts on HN regarding recursive CTEs recently and I shy away from them because it seems like an additional invocation for each different recursive lookup and a lot of the queries can be rewritten and potentially optimized using more traditional approaches.


I've generally seen recursive queries in SQL Server perform as well as I expected or sometimes slightly better, so I'm happy to use them (rather than a stored procedure for example which have, on the occasions I've done proper tests, turned out to be less well performing if only by a little) where I need to read from structures that require it.

Where I have control of the design though, I still prefer to create structures that don't* require recursive queries*, even considering this means a little extra logic on insert/update/delete. Depending on circumstance (and for these examples assuming a tree or forest, not a more general graph) this structural difference is either storing the full ID path of each node or a matrix of "node Z is under node Y by one level, node Z is under node X by two levels, node Y is under node X by one level, ...), instead of (or as well as) the naive tree structure. Well written operations on these structures perform better than a recursive query, sometimes significantly so, but of course it is a trade-off as you have to accept the extra space (and potentially RAM) needed and extra processing on node changes, additions, deletes, & moves.



To summarize for anyone reading this thread, and thank you fleetfox for posting this:

"Yes, you heard that right, the SQL statement is 5x more performant at the database level alone."

I will have to do more testing and see what I can accomplish.


I've had positive experience with Postgresql, but I haven't personnaly witnessed how it behaves in production. In my opinion there aren't good alternatives to a recursive query:

- Send one request for each level of hierarchy? Worst of all.

- Stored procedure? I don't think it would be optimised compared to a recursive query.

See also: http://stackoverflow.com/questions/5861272/postgresql-with-r...



Ugh, CTE's to construct a tree/graph in SQL Server brings back terrible memories. I'd estimate 25% of our time at one point was spent building aggressive caching in front of it due to the awful performance as more nodes were created.

On a different product now (using PG) and guess what has come up? We need tree nodes! My first thought: anything but CTEs. ltree, materialized paths, neo4j, anything but CTEs. They've just burned me so badly before I'm too afraid to try them in PG.


I worked at a place that had recursive ACLs. We used CTEs and it was pretty darn quick. We only had on the order of a million records, though, so ymmv.


How quick is quick? (ms?) What kind of hardware were you running? (all data in ram?) Just wondering, still interested in the topic!


I haven't done nearly enough research to back this point of view, but it seems to me that if you really need to make a recursive query over a data tree, you should maybe reconsider how that tree is being modeled. The closure table model [1] is simple to query non-recursively, and unless your tree is pathological, it doesn't add that mcuh space overhead. I suppose though that a built-in recursive query might not be so bad though, if it's just a whole bunch of indexed row lookups within one query.

Perhaps there are situations where recursive queries really are the only solution. I'd love to hear some examples.

[1] http://technobytz.com/closure_table_store_hierarchical_data....


How do you determine if your tree is pathological? I.e., what types of tree structures would you find the closure table model to be pathological enough to warrant other modeling options?


Good question -- so the most pathological tree of n nodes would be one with depth of O(n). If this was the case, the closure table would be of size O(n^2). In tree applications I can think of--like an org chart or site comments--the depth is O(log(n)), in which case the closure table is just O(n). Note that this same kind of pathological tree would be hell for a recursive query too, just trading space for time.


For trees that have few inserts/deletes and a large number of reads (like most threaded comments), I'd probably go with a modified preorder tree traversal (MPTT) implementation.

With an MPTT structure you can query the entire tree (or subsets of said tree) with a single non-recursive SQL query, at the expense of having to use multiple statements to perform inserts and/or deletions. If you're really concerned about insert performance (which can be an issue for large trees), there are always hybrid approaches where you use adjacency lists for maintaining newly inserted nodes and then run an out-of-band operation to rewrite the relevant MPTT tree values.

*note: MPTTs used to be all the rage maybe 6-8 years ago, and it's like everyone forgot about them. What's up with that?


PostgreSQL has a built-in ltree contrib datatype. Wouldn't that work just as well?


ltree isn't going to be a one-size-fits-all if you need more than one or two simple sorting conditions. Also, since ltree is really a glorified char field, you end up sorting by character ordinals instead of full values.

Of course you can slice and dice the ltree string up, but at what point do you just use a CTE?


How fast is this comparing to nested sets? Some time ago I did my own nested sets implementation in Oracle and for 100k+ records the path from a leaf to root as well as all sub-tree nodes retrieval took seconds with WITH and milliseconds with nested sets (2-3 orders of magnitude faster).


Recursive CTEs are a cool feature, but it's a shame the syntax is so awful. You really have to work to understand what the query is trying to do.

Other query languages that make recursion much more bearable are SPARQL (on RDF data), Cypher (Neo4j) or Datalog (Datomic, Cascalog, etc). Would love to see more of those query languages being used.


I think it's a shame that many database courses don't even mention advanced features like CTEs or Window functions.

It makes me cringe anytime I see a statement with complicated nested self joins to solve a problem that should not exist because most current database systems implemented the relevant standards years ago.


Better off with Closure Trees (aka Closure Tables.)

Start at page 48.

http://www.slideshare.net/billkarwin/sql-antipatterns-strike...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: