Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The surprising impact of medium-size texts on PostgreSQL performance (hakibenita.com)
182 points by haki on Oct 20, 2020 | hide | past | favorite | 16 comments


I've mentioned this story here before, but one of the most surprising performance gains I saw was by eliminating TOAST look ups. If I recall correctly, each time you use the `->>` operator on a TOASTed JSONb column, the column will be deTOASTed. That means if you write a query like:

  SELECT x ->> 'field1', x ->> 'field2', x ->> 'field3'
  FROM table
and x is TOASTed, Postgres will deTOAST x three different times. This multiplies the amount of data that needs to be processed and dramatically slows things down.

My first attempt to fix this was to read the field in one query and use a subselect to pull out the individual fields. This attempt was thwarted by the Postgres optimizer which inlined the subquery and still resulted in deTOASTing the field multiple times.

After a discussion with the Postgres IRC, RhodiumToad pointed out that if I add OFFSET 0 to the end of the subquery, that will prevent Postgres from inlining it. After retrying that, I saw an order of magnitude improvement due to eliminating the redundant work.


This sounds like an interesting optimization possibility. Do you know will this be patched?


the funny thing is that pg shouldn't need to do a detoasting to be able to query as long as it's jsonb, it's a pretty straightforward iteration - we do the same thing when converting from jsonb to v8 in plv8.


How come? As far as I know if the jsonb data was toasted (too big to fit in 2kb), then in order to query anything back from inside the jsonb you have to de-toast it, by definition.


How do you find out that the optimiser inlined the subquery? Is this visible in the explain output?


IIRC, EXPLAIN VERBOSE will show you the columns being selected by each step of the plan. The inlining showed up there.


or better yet: EXPLAIN ANALYZE VERBOSE - that will give you a little bit better picture.


The article only briefly touches on this, but you can control various aspects of how postgres decides to store your value using `ALTER TABLE ... SET STORAGE` including the ability to make it inline, uncompressed, compressed, or TOASTED and also to change the threshold parameters to when postgres decides it for you. The documentation is about a 1/3 of the way down this page:

https://www.postgresql.org/docs/13/sql-altertable.html

This is useful if you have, for example, short character codes coming from some external system. The default will be EXTENDED (external, compressed) but in order to make the absolutely smallest stored tuple possible, if you know it's a short or fixed length, you can go MAIN (internal, compressed) or PLAIN (internal, uncompressed).


This is why most databases have separate char/varchar and text types: char and varchar are stored inline, and text is stored externally.

PostgreSQL doesn't do this, there is no difference between any of the character types, they are all stored the same way, and the type only serves to validate the data. See: https://www.postgresql.org/docs/current/datatype-character.h...

MySQL even has TINYTEXT for when you want to store small strings outside the table (for performance). With tinytext only a single byte is stored in the table, and the rest externally, so I/O is reduced when doing a full table scan, if you don't need to read the tinytext column.


For MySQL with the default InnoDB where text bytes also depends on the 'row format' in use. "The InnoDB storage engine supports four row formats: REDUNDANT, COMPACT, DYNAMIC, and COMPRESSED."


For those wondering, “TOAST” actually stands for “The Oversized-Attribute Storage Technique”. See https://www.postgresql.org/docs/13/storage-toast.html. The link also provides some implementation details.


There can be significant impact from off-page types (VARCHAR(MAX), NVARCHAR(MAX), VARBINARY(MAX), and the deprecated TEXT/NTEXT/BINARY) in SQL Server, and presumably other DBMs, too.

Though one key difference is that SQL Server doesn't compress the off-page parts as this article states postgres does. In fact even if you have the table set to compress using either row or page compression option, off-page data is not compressed.


The way PG handles this is better than SQL Server was back when I used that primarily (2008r2). The compression is not as amazing as it could be, but it is quite helpful regardless. It'll be nice when the work on the "pluggable compression" [1] (or similar) is committed, allowing the use of better compression algorithms like zstd.

[1]. https://www.postgresql.eu/events/pgconfeu2019/sessions/sessi...


IIRC 2008r2 didn't support compression at all (I think it was added in 2012 for Enterprise/dev editions then all other editions in 2016sp1?), at least not without enabling CLR and writing your own modules to handle it.

The latest versions still don't compress off-page data at all (with the exception of using 2019+s support for UTF8 in place of fixed two-bytes-per-character string types as a form of compression) though there are methods using triggers, backing tables, and the [DE]COMPRESS functions if you really need LOB compression and don't need things like full text search.


One thing to keep in mind is that TOAST tables also require vacuum. As any other table, they will also count towards the TXID limits. Make sure to tune the DB accordingly if you have too many of those and the DB is busy with writes.


Very interesting! I've spend many moons optimizing postgresql configs and queries, never ran across this before :)

Pretty sure I have some targets to try this on even




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: