The main table is stored in a 'heap', which is more similar to the dynamic alloc...

uhoh-itsmaciek · on May 20, 2022

>Variable length strings are kept in a separate file (TOAST tables), referenced by the tuple. This keeps records fixed size...

That's not entirely true: variable-width row values under a size threshold are stored inline (possibly compressed).

The beauty part about TOAST is that they're mostly just regular tables (any table that needs its data TOASTed gets its own TOAST table). You can look them up in the system catalogs and query them directly.

The schema for these is (chunk_id, chunk_seq, chunk_data). Chunk "pieces" are limited in size so they can be stored inline (obviously there's not much benefit to a TOAST table itself having a TOAST table), so any row values that need to be TOASTed may also be broken up into multiple pieces. Row values stored inline in the "owning" table reference a chunk_id where the TOASTed data is stored, and queries that need to return that data look up all the chunks for a chunk id, ordered by chunk_seq.

You're right that this scheme avoids reading TOASTed data when you're not actually using it (a good reason to avoid "SELECT *" if you've got big TOASTed data).

tomnipotent · on May 20, 2022

> but not for tuple storage

Both tables and indexes use the same b+-tree implementation based on Lehman and Yao (prev/next links on internal nodes).

> Each page can hold some number of fixed size tuples

Variable size.

> Variable length strings are kept in a separate file

Only if the row exceeds the page size, otherwise everything is stuffed into the data leaf node.

> I'm not quite sure how the allocation scheme or data structure works there

Values are broken up into DataSize/PageSize "virtual rows" and stitched back together, but everything is still in the same slotted page structure.

SigmundA · on May 20, 2022

>Both tables and indexes use the same b+-tree implementation based on Lehman and Yao (prev/next links on internal nodes).

Tables in Postgres are not btree's they are unordered heaps. The row id in the b-tree index points to a page number + item index in the heap.

A true clustered index would simply expose the index itself as a table without secondary heap storage, reducing I/O and storage space. Row visibility seems to complicate this in PG as index only scans may require checking the heap for visibility.