Hacker News new | past | comments | ask | show | jobs | submit login
Following a Select Statement Through Postgres Internals (2014) (patshaughnessy.net)
302 points by rspivak on April 10, 2016 | hide | past | favorite | 27 comments



This article was also discussed before: https://news.ycombinator.com/item?id=8449329

Pat Shaughnessy has 3 other similar articles about Postgres internals. He does a great job of combining high level explanations, visual aids, and gritty details.

Discovering the Computer Science Behind Postgres Indexes http://patshaughnessy.net/2014/11/11/discovering-the-compute...

A Look at How Postgres Executes a Tiny Join http://patshaughnessy.net/2015/11/24/a-look-at-how-postgres-...

Is Your Postgres Query Starved for Memory? http://patshaughnessy.net/2016/1/22/is-your-postgres-query-s...


Thanks everyone for the kind comments, and for taking the time to read this stuff :)


I'm not going to read all of that right now. Buy you've earned yourself a subscriber.


"How Postgres Executes a Tiny Join" covers a case you usually try to avoid - a join with no indices. For that it has to create temporary hash tables. If indices are available, the usual approach is to walk one index while searching the other. Or both tables may be sorted and merged. Or an unindexed table sorted while an indexed table is walked.

The whole process is quite clever, and this is what SQL engines are really about.


"What’s going on here? Why is Postgres wasting its time, continuing to search even though it’s already found the data we’re looking for?"

Because Postgres has not already found what we are looking for. The query is

  select *
  from users
  where name = 'Capitain Nemo'
  order by id asc
  limit 1;
Only after Postgres has found all users with the name 'Capitain Nemo' it can sort them by their 'id' attribute and limit the result set to the first one in the sorted relation.

Otherwise very nice post though.


Yes, the OP here [0] says that the query was one that was generated from ruby's ActiveRecord. Clearly, removing the sort clause would speed up this query. I believe the issue (from the ruby perspective) is using ActiveRecord's `User.where(name: 'Captain Nemo').first` instead of `User.where(name: 'Captain Nemo').take` [1]

[0] https://news.ycombinator.com/item?id=8449329 [1] http://api.rubyonrails.org/classes/ActiveRecord/FinderMethod...


The author doesn't show the schema for the table. If the "Name" field was UNIQUE, the search could stop at the first hit. Most SQL optimizers are smart enough to do this. UNIQUE is usually used with an index (otherwise, inserting a new record requires a full table scan for the uniqueness check) but that's not required.


One thing I've gotten pretty good at is taking some question I have about some software and answering it by looking through the source. I've wondered whether people would find it interesting if I showed the process of how I do this.

I've wondered whether YouTube might be a good medium for this. Blog articles feel kind of "produced" -- I thought it might be cool to capture the exploratory nature of how this works.


Please go blog for this. I can read a blog anywhere. I don't need headphones. I can pause just by looking away. I can read text at my own pace and look things up. I can even print it out and add notes.

You can add the investigative process you used to get there, and note what didn't work and why.


I'm the same way: take a question and answer with the source. But as my time gets more and more constrained, I wish more large open-source projects had design docs akin to CockroachDB. I read through the CDB design doc in one sitting, then read the source and it was such a treat to have that high-level context while wading through the details.

I can't say the same thing for ElasticSearch (or most large open-source projects). It's very difficult to get the high-level context without a well-written design doc explaining why things are the way they are. Some parts of ElasticSearch make sense, and other parts just seem hackneyed. Of course, I could pull together little bits and pieces of high-level insight from multiple disparate blog posts and forums and mailing lists here and there.

But what I would like is a design doc! If you were to produce something along the lines of that for open-source projects that need it, I wouldn't hesitate to pay a modest fee.


  > If you were to produce something along the lines of that 
  > for open-source projects that need it, I wouldn't hesitate 
  > to pay a modest fee.
Try BountySource [1] maybe?

There you can post a proposal for some improvement to an OSS project (a design doc, in your case) and pledge a bounty for it.

If other people have the same problem, they might chip in further, making it more visible/profitable to solve.

----

[1] https://www.bountysource.com/


Thanks for the link! First time I've seen that marketplace. I'll look into it.


Postgres actually has a lot of that kind of documentation in the form of README files in source directories.


Absolutely! Postgres has amazing documentation.

My comment was directed towards other projects. I wish they would follow Postgres's lead.


I'd be really interested to see something like this. YouTube probably is the best medium for showing content like this. I can think of a few examples of "expert" coders showing how they work through typical development scenarios, but can't remember coming across anything like this for postgres.


As a mathematician trying to enter the world of software engineering I think this would be a useful resource!


I've been looking for books or articles on Database Architecture and internals for a while now.

I found this paper "Architecture of a Database System" [1] to be an excellent starting point, it's written by database veterans James Hamilton and Micheal Stonebraker and references other much deeper works.

I've always liked Pat Shaughnessy's work, Ruby under a Microscope was a fantastic introduction to MRI 2.0. I really hope to see more works like these.

[1] http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf


That paper is excellent and serves as a great intro for a graduate database course. It would be remiss not to list the main author though (Joe Hellerstein, Berkeley).


I assume you know about the red book? I haven't read it, but it always comes up in these discussions. http://www.redbook.io/


> Next Postgres analyzes and rewrites our query, optimizing and simplifying it using a series of complex algorithms.

Analyze & rewrite doesn't really use any complex algorithms.

Analyzing basically means that we resolve tables/objects/operators/databases/... in the parse-tree into what they mean in the current database with the current settings. The process of parsing itself (via a bison parser) doesn't do any object lookups and such. So errors about non-existant tables, operators and such will mostly happen during the analyze phase.

The rewrite phase, which often will do nothing, will resolve both explicitly (CREATE VIEW) and implicitly (row-level security) referenced views by their definition. It also processes rules in an equivalent manner, but you should never use those...

The actual optimization you're referring to will happen as part of planning.

EDIT: typos


Rewrite phase is also supposed to "flatten subqueries/views" (cf. "Extensible/rule based query rewrite optimization in Starburst"; SIGMOD 92) and can be fairly complex. PostgreSQL doesn't do it to the same extent as DB2, but there is certainly some complex logic involved there.


That paper's definition of rewriting isn't the same as postgres' (they even say that: "The work presented here should not be confused with the query rewrite facility of POSTGRES [SJGP90]."). Flattening subqueries and views happens during planning in postgres.

Source: I'm a PostgreSQL Developer.


Rewrite transforms the sql to use a pre-aggregated materialized view if one exists


PostgreSQL, unfortunately, can't do that. But if it were to happen, it'd not be in the rewriter. It'd not make sense to attempt such transformations before some other optimizations (like flattening of subqueries et al), and that happens in the optimization step


Interesting stuff, I've recently got interested in databases and did some toy programming related to them, trying to understand how they work.

I just wish the author had gone closer to the nitty gritty details of _how_ the data is actually fetched from the disk. How does Postgres store the data? What do the data files look like, how is the parsing done?

In any case I appreciate the effort. I guess I might have to dive into the source code myself.


There's some docs about this in the docs. c.f. http://www.postgresql.org/docs/devel/interactive/storage.htm...


Depending on the indexes and other magic, more efficient might be:

select * from users where name = 'Captain Nemo' and id = (select min(id) from users where name='Captain Nemo')

trying to forget bad memories of forcing the Oracle query planner into submission with more hints than actual sql




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: