So I'm coming at this from a Postgres POV, IIRC MySQL can take some shortcuts fo...

evanelias · on May 29, 2022

It's slow and expensive in MySQL too, assuming you mean InnoDB storage engine. Recent versions of MySQL can parallelize a COUNT(*), but that only helps so much.

The "shortcuts" you're thinking of may be from the MyISAM storage engine days. Since MyISAM doesn't have transactions or MVCC, it can maintain a row count for the table and just return that. But MyISAM is generally not used for the past decade (or more), as it just isn't really suitable for storing data that you care about.

Storage engines with transactions cannot just store a row count per table because the accurate count always depends on your transaction's isolated snapshot.

nolevels · on May 28, 2022

Not necessarily in postgres. If an index is available it may use that, though it will still have to refer to the visibility map to see if any row is available to the current transaction (cheap check), and if it can't determine from that, then and only then does it need to also refer to the heap / actual table (slower check).

inferiorhuman · on May 28, 2022

Is that a recent change? Back in 2016 the Citus guys claimed that:

  There is no single universal row count that the database could cache, so it must
  scan through all rows counting how many are visible. Performance for an exact
  count grows linearly with table size.

https://www.citusdata.com/blog/2016/10/12/count-performance/

nolevels · on May 28, 2022

Apologies, I should've clarified. I just meant postgres doesn't necessarily have to scan the actual table to get the count in some cases if an index is available, not that it caches the actual count.

inferiorhuman · on May 28, 2022

So that's not the greatest quote but the article shows postgres having to fall back on a sequential scan for a naive COUNT(*). The author goes on to mention that as of 9.2 Postgres can do an index scan (in the context of SELECT DISTINCT):

  But now we come to a quirk, SELECT COUNT(DISTINCT n) FROM items will not use
  the index even though SELECT DISTINCT n does. As many blog posts mention (“one
  weird trick to make postgres 50x faster!”) you can guide the planner by rewriting
  count distinct as the count of a subquery

To me it seems like you can trick the query planner into doing an index scan sometimes, but that it's a bit brittle. Has this improved much since?

nolevels · on May 28, 2022

I think the wiki explains it well

https://wiki.postgresql.org/wiki/Index-only_scans#Is_.22coun...

It says that index only scans can be used with predicates and less often without predicates, though in my experience I've seen the index often used even without predicates.

inferiorhuman · on May 29, 2022

Relevant quote:

  Index-only scans are opportunistic, in that they take advantage of a pre-existing
  state of affairs where it happens to be possible to elide heap access. However,
  the server doesn't make any particular effort to facilitate index-only scans,
  and it is difficult to recommend a course of action to make index-only scans
  occur more frequently, except to define covering indexes in response to a
  measured need

And:

  Index-only scans are only used when the planner surmises that that will reduce the
  total amount of I/O required, according to its imperfect cost-based modelling. This all
  heavily depends on visibility of tuples, if an index would be used anyway (i.e. how
  selective a predicate is, etc), and if there is actually an index available that could
  be used by an index-only scan in principle.

So yeah, I don't know if your use case is exceptionally lucky or if the documentation is just pessimistic (or both). Good to know you can coerce the query planner into doing an index scan though.