Hacker News new | past | comments | ask | show | jobs | submit login

How do you find on the backend if a page is the last one? The trick I used to avoid extra SQL queries is if a user asks for 10 records I load up to 11 and then return 10. If there were 11 then I return a cursor for the next page, otherwise next page is null. Is there a better way?



Moreover, you can query, like, +3 more, and if the extra is less than 3, return that extra items (short tail) also. This way, you won’t have the last page with 1-2 items only.


How do you deal with cases where results must be ordered by some field and the data changes between page 1 request and page 2 request?


If the ordering changed because the data changed then just show the new situation at that page. If the results’ order is important there are several things you can do (at least):

1. You could save the order under a token and iterate respective to that saved order, only showing the data for the ids that were originally included. 2. Instead of continuing from a page count show the next amount. Any mutation before that point will be invisible without having to store the original result list (only ids).


You could select into a temporary table and return pages of results from that.

You'd need some mechanism to delete the temporary table after a time, of course, but I imagine this is not uncommon.

Another method would be having a last-modified timestamp on every row in the DB; then you could select records at or before the initial query time. Seems like overkill just for this one purpose, but I imagine it might be generally useful information to have in the DB.


I mean there's no great way, the list of tools you might deploy is very long though:

(1) Don't solve the problem, trust the user to figure it out. This is the default (update rows and if the data updates underneath someone then tough) and it works in 95%? of cases or so. Works poorly when computers consume your results.

(2) Versioning tables! Forget updates let's just insert all the things!

    SELECT * FROM projectVersions
      NATURAL JOIN (SELECT projectKey, MAX(created) AS created 
            FROM projectVersions 
            WHERE created <= :time GROUP BY projectKey) AS latest
(or join ON or USING..., use an incrementing version, etc. You need a UNIQUE index on (projectKey, created) which happens to also make this quite performant given the DB sizes involved...)

Deletion is somewhat finnicky here, soft-deletion works just fine as-is but hard deletion brings back issues from issue (1) again. The bigger problem is that if a "project" is a big record then you start copying all this data and your DB grows huge.

(3) Snapshots + Diffs! The previous storage concern can be alleviated by only updating the whole record (a new "snapshot") after 10-20 diffs are recorded, otherwise just store the diff. It's O(1) if you ensure a constant max of diffs... If referential integrity via foreign key constraints is really important to you, the most extreme version of this is a table structure,

                          snapshotId
    [ projectSnapshots ] <----------- [ projectPrimaryContactDiffs ]
           \                                 |
            \                                | contactId
             \   primaryContactId            V 
              -------------------------> [ contacts ]
where each foreign key column has to become its own diff table to enforce the constraint! Then it's absolutely important for the snapshot table to have a metacolumn counting diffs, hah. But if you are not this picky then a single table projectSnapshotDiffs can work.

Other variations of this have extra diff columns sitting on `projectSnapshots` allowing you to change things, possibly just one column which is JSON, etc. ... there are tradeoffs on how available you need the versioned data to be to your database itself for DB queries.

(4) Graph databases. The most extreme form of (3) where we abandon snapshots: maybe every row in some table represents a diff for some column at some time! This ultimately creates a graph database, sentences or "facts" ("Horn clauses") are recorded "at time T, the [subject] [verb]ed such-and-so [object]," you store "record/fieldName/fieldValue" tuples. Sometimes you can also include an integer index to +1 assert or -1 retract facts--queries then sum over the facts, when you find values that are not 0 or 1 you can guess that a _merge conflict_ happened.

(5) Give up and do functional programming :) What this means is, if all of your records are immutable then you get time persistence for free, and you often don't need to copy data as in (2) because you can use structural sharing. The flip side is that everything needs to be accessed by _pointers_ rather than _indexes_, or else whatever is indexed needs to be copied, which can get expensive.

It's worth giving a story for why you would do this. You say up-front, "I want the ability to version a list of Resources for this Project, just like the other fields in my Project Snapshot. When you look at the Project at version 53, you should see the resources that were in that project."

Well, you start with Resources that point to a Project Snapshot or so, but whenever you generate a new Project Snapshot you find yourself copying a bunch of Resources, even when the resources haven't been updated. You maybe insert a ProjectResources many-to-many table, but you're still inserting a bunch into that many-to-many table even when the resources haven't been updated.

The FP instinct comes in when you say "I will have Projects have a reference to a ResourceList and store the reference to that ResourceList, so that I don't have to do this copying when I don't update the resources." Now the arrow has been (somewhat) inverted, Projects point at Resources (somewhat) rather than Resources pointing at Projects. Problem solved, sort of:

    [ projectSnapshot ] ----> [ resourceList ] <--- [ resourceListItem ]
                                                         /
                                [ resource ] <-----------
The problem still comes back to bite you later when it turns out one of your biggest power-user clients has 50,000 Resources which they are constantly updating, and now each time they update a Resource they trigger the creation of a new resourceList and hence the copy of 50,000 ResourceListItem records.

The story is that working backwards you say "ok I'm going to bound their update so each of their updates only inserts, like, 50 ResourceListItem records max," so now you have some notion of a ResourceListPages table, the resource belongs to a page of 50 or so resources in the ResourceList... except now you're still inserting 1,000 pages when you reconstruct every ResourceList. It got way better but it's still not scaling. In desperation you reach for a recursive structure and write out this comment in some source code, haha:

    /**
     * A ResourceList of order 0 is a page of up to 50 Resources which
     * foreign-key to it, for balancing reasons we store here a "count" of how
     * many Resource records point to it.
     *
     * A ResourceList of order N > 0 is a collection of pointers to up to 8 
     * ResourceLists of order N-1, called subList1, subList2, ... subList8, 
     * and the count is the sum of counts of all of the sublists.
     */
You'll have to be in a rather unique case before you run into this, haha, but when you do, you'll be glad if your database supports recursive queries with the WITH statement in SQL. And then SQL will be happy to fetch all of the ResourceLists of order 0, aggregate all of the Resources that point at them, and order them into reliable pages however you like.


You could ask for a count of records and divide by the page size.


Counts become incredibly slow on some databases like Postgres. It needs to do a sequential scan, which results in slow(er) performance on even a ten- thousands of rows. When operating at scale, this is very, very slow.


From what I understand it's not slow if it can use an index, but the criteria for that are rather hard to understand as a normal user. If you vacuum often enough and the visibility map is current, it doesn't need a full table scan.


Isn't count slow on most databases? Either it ends up in table scan or index scan. But if the index is huge, then that will also take a lot of time.

I wonder how do Big Co solve this counting problem.


This seems unlikely to depend on which database software. It could easily depend on what indexes exist for the query.


This is what I do, since counts should cause minimal overhead (I've never had to deal with huge databases). But neither does the N+1 records trick which OP describes, that's clever. :)


Counts can cause significant overhead on even millions of rows.


If I were to ask the number of even numbers greater than zero and less than 10 billion, you could very quickly give an exact answer. If I were to ask the exact number of prime numbers on that same range, it would take much longer to find the correct answer.

In the same way, asking for counts may or may not be a quick answer, depending on both the table structure and the query being performed. Query the number of customers who started this year, on a table indexed by start date, and it's incredibly cheap. If that same table is only indexed by last name, then counts become more expensive. I'd say that the advice to return the counts has an implicit addendum to make sure that your database is appropriately indexed such that your typical queries are cheap to perform.


So I'm coming at this from a Postgres POV, IIRC MySQL can take some shortcuts for COUNT(*). Constraining the query will help but counting is expensive as a sequential table scan is always required.

Meanwhile I can only think of one use case for pagination with exact row counts – a user facing application. If you're preallocating storage you can get by with an estimate. If you're consuming an API and want something from near the end you can flip the sort order.


It's slow and expensive in MySQL too, assuming you mean InnoDB storage engine. Recent versions of MySQL can parallelize a COUNT(*), but that only helps so much.

The "shortcuts" you're thinking of may be from the MyISAM storage engine days. Since MyISAM doesn't have transactions or MVCC, it can maintain a row count for the table and just return that. But MyISAM is generally not used for the past decade (or more), as it just isn't really suitable for storing data that you care about.

Storage engines with transactions cannot just store a row count per table because the accurate count always depends on your transaction's isolated snapshot.


Not necessarily in postgres. If an index is available it may use that, though it will still have to refer to the visibility map to see if any row is available to the current transaction (cheap check), and if it can't determine from that, then and only then does it need to also refer to the heap / actual table (slower check).


Is that a recent change? Back in 2016 the Citus guys claimed that:

  There is no single universal row count that the database could cache, so it must
  scan through all rows counting how many are visible. Performance for an exact
  count grows linearly with table size.
https://www.citusdata.com/blog/2016/10/12/count-performance/


Apologies, I should've clarified. I just meant postgres doesn't necessarily have to scan the actual table to get the count in some cases if an index is available, not that it caches the actual count.


So that's not the greatest quote but the article shows postgres having to fall back on a sequential scan for a naive COUNT(*). The author goes on to mention that as of 9.2 Postgres can do an index scan (in the context of SELECT DISTINCT):

  But now we come to a quirk, SELECT COUNT(DISTINCT n) FROM items will not use
  the index even though SELECT DISTINCT n does. As many blog posts mention (“one
  weird trick to make postgres 50x faster!”) you can guide the planner by rewriting
  count distinct as the count of a subquery
To me it seems like you can trick the query planner into doing an index scan sometimes, but that it's a bit brittle. Has this improved much since?


I think the wiki explains it well

https://wiki.postgresql.org/wiki/Index-only_scans#Is_.22coun...

It says that index only scans can be used with predicates and less often without predicates, though in my experience I've seen the index often used even without predicates.


Relevant quote:

  Index-only scans are opportunistic, in that they take advantage of a pre-existing
  state of affairs where it happens to be possible to elide heap access. However,
  the server doesn't make any particular effort to facilitate index-only scans,
  and it is difficult to recommend a course of action to make index-only scans
  occur more frequently, except to define covering indexes in response to a
  measured need
And:

  Index-only scans are only used when the planner surmises that that will reduce the
  total amount of I/O required, according to its imperfect cost-based modelling. This all
  heavily depends on visibility of tuples, if an index would be used anyway (i.e. how
  selective a predicate is, etc), and if there is actually an index available that could
  be used by an index-only scan in principle.
So yeah, I don't know if your use case is exceptionally lucky or if the documentation is just pessimistic (or both). Good to know you can coerce the query planner into doing an index scan though.


Thanks, that's good to know (will have to perform some tests at some point). I do have a few tables (namely logs) that will have millions of rows eventually over time. But there I've also been planning on rotating them whenever that becomes a problem, so I'm not overly concerned. Or experiment with OP's solution.


Sure, but that point you've really got to start thinking about architecture.

Do you really need to paginate millions of rows and count them? Can you distribute the database and parallelize counting? Can you save counts somewhere else? Can you use estimates?


Query Limit +1 and then check the presence of the last item (Limit +1 item). Never use `count`. They are usually slow, and you might need to add more filters in the future. I will add this to the article.


That’s what we do, I’m not sure there is another way other than making a request for the next page.


Just let people query for an empty page every now and then, no big deal.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: