Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The main table is stored in a 'heap', which is more similar to the dynamic allocation region of a program rather than the data structure.

If it needs to write to the table, it has a free space map listing all of the available fixed size pages. Each page can hold some number of fixed size tuples.

I'm surprised they've gotten so far with this. It seems ripe for fragmentation issues. I'm surprised they don't at least have a b+tree implementation to provide clustering based on the primary key. They have a b+tree implementation for indices, but not for tuple storage. IIRC, there's a feature to allow fixed size data row to be stored directly in the index, to help alleviate the need to look in the heap table. Perhaps that's what's kept the heap as viable?

Variable length strings are kept in a separate file (TOAST tables), referenced by the tuple. This keeps records fixed size, and allows for circumventing disk reads for string data for queries that don't require them. I'm not quite sure how the allocation scheme or data structure works there. My (very speculative) guess is that they're stored as a linked list, and on write, the allocator tries to give contiguous pages.

fwiw, I don't use postgresql on a regular basis, I've just spent some good amount of time squinting at the file format and related code, soley due to my own curiosity. It's been a while, so it's very possible that I'm misrepresenting or misremembering something. Someone has already posted their documentation on the file format -- it's a good read if you're interested.



>Variable length strings are kept in a separate file (TOAST tables), referenced by the tuple. This keeps records fixed size...

That's not entirely true: variable-width row values under a size threshold are stored inline (possibly compressed).

The beauty part about TOAST is that they're mostly just regular tables (any table that needs its data TOASTed gets its own TOAST table). You can look them up in the system catalogs and query them directly.

The schema for these is (chunk_id, chunk_seq, chunk_data). Chunk "pieces" are limited in size so they can be stored inline (obviously there's not much benefit to a TOAST table itself having a TOAST table), so any row values that need to be TOASTed may also be broken up into multiple pieces. Row values stored inline in the "owning" table reference a chunk_id where the TOASTed data is stored, and queries that need to return that data look up all the chunks for a chunk id, ordered by chunk_seq.

You're right that this scheme avoids reading TOASTed data when you're not actually using it (a good reason to avoid "SELECT *" if you've got big TOASTed data).


> but not for tuple storage

Both tables and indexes use the same b+-tree implementation based on Lehman and Yao (prev/next links on internal nodes).

> Each page can hold some number of fixed size tuples

Variable size.

> Variable length strings are kept in a separate file

Only if the row exceeds the page size, otherwise everything is stuffed into the data leaf node.

> I'm not quite sure how the allocation scheme or data structure works there

Values are broken up into DataSize/PageSize "virtual rows" and stitched back together, but everything is still in the same slotted page structure.


>Both tables and indexes use the same b+-tree implementation based on Lehman and Yao (prev/next links on internal nodes).

Tables in Postgres are not btree's they are unordered heaps. The row id in the b-tree index points to a page number + item index in the heap.

A true clustered index would simply expose the index itself as a table without secondary heap storage, reducing I/O and storage space. Row visibility seems to complicate this in PG as index only scans may require checking the heap for visibility.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: