Hacker News new | past | comments | ask | show | jobs | submit login
Paginating Requests in APIs (2020) (ignaciochiazzo.medium.com)
259 points by exemel_hn on May 28, 2022 | hide | past | favorite | 185 comments



You should never design an API that uses offsets for pagination unless you're dealing with small amounts of data (<10000 rows). Cursors give you far more flexibility and if you want to be lazy, you can just hide an offset in an opaque cursor blob and upgrade later on.

I don't think I've used offsets in APIs for at least 10 years. Lightly-obfuscated cursor tokens are one of the first things I build in web projects and that's usually less than an hour's work.

If you _really_ need the ability to drop the needle in your dataset with pagination, design your system to use pseudo-pagination where you approximate page-to-record mappings and generate cursors to continue forward or backward from that point.


Iirc digitalocean API uses offsets. We had to write special code to handle the possibility of the list changing while it was being queried.


When I worked for a news site, we moved all listing endpoints from offset to timestamp based pagination because offset based pagination assumes the data being queried will remain stable between requests. This was, of course, very rarely true during business hours.

Duplicates or missed entries in the result set are the most likely outcome of offset-based pagination.


Do any of these avoid that scenario? Unless you are paginating on `updated_at ASC` there is an edge case of missing new data on previously requested pages.


Missing data that began existing after you started querying is usually OK. If you had requested the data 1 second earlier, it wouldn't have been returned anyway. With offset pagination, you can miss data that always existed. This can make you believe the item that previously existed had been deleted.

Be very careful with timestamp or auto-increment id pagination too. These don't necessarily become visible in the same order since the id or timestamp is generated before the transaction commits unless your database has some specific way of ensuring otherwise.


What do you use then that has the same order as rows becoming visible?

We use an auto-increment id, and lock inserts on the related account (which always limits the scope of the query).

The only other (stateless) way I can think of is to somehow fiddle with transaction numbers linked to commit order.


Sorry for replying very late. I've used a similar technique of locking a "parent" row while adding items to a corresponding set. It works great as long as you can figure out a relationship like that and it's fine-grained enough that you are OK with the locking.

In traditional databases, the number linked to the commit order is usually the LSN (Log Sequence Number), which is an offset into the transaction log. Unfortunately, you can't figure that out until your transaction commits, so you can't use it during the transaction.

A hypothetical database where you could see your own LSN from within a transaction would require transaction commit order to be pre-determined at that point. An unrelated transaction with a lower LSN would block your transaction from committing.

In non-traditional databases, this could work differently. E.g. in kafka you can see your partition offsets during a transaction and messages in that partition will become visible in offset-order. The tradeoff is that this order doesn't correspond to global transaction commit order and readers will block waiting for uncommitted transactions (and all the other things about kafka too).


Exactly.


Something always feels off about repopulating lists that people are working on. Like that concept needs an entirely different display method.


It’s what happens with stateless protocols and offset based pagination 100% of the time, but most people don’t notice it.


how do you jump to arbitrary pages with cursor pagination?


You don't, but if your cursor is just a position in some well-defined order, you can sometimes craft a cursor out of a known position. Continuing the book metaphor, instead of skipping to page 100, which has no semantic meaning, you could skip to the beginning of the "M" section of the dictionary, or see the first 10 words after "merchant".


Forum posts are the classic pagination example (in my mind), but I feel like the type of cursor pagination you’re talking about might work better.

Skipping forward by date, by username, or even by concept might make long threads much quicker to scan through and understand.


You generally don’t, unless you want to linearly scan through n pages. If your API is using offset or page numbers and the underlying collection is also having items added or removed, the behavior is undefined or straight up wrong. I think it’s okay to use offsets or page numbers in cases where the collection is static or where it’s acceptable to occasionally skip over an item or see a duplicate item while paginating.


Or where you want people to be able to send each other links of the ‘look on page 11’ kind.


Going with the point re: cursor links, if your URLs look something like this then they would be shareable and more stable than page=11.

  ?startWith=<item_id>&sortBy=<alpha|datetimedesc|whatever>


Just impossible to navigate to by voice. In the office here people often call out the number of the page, but nobody will call out cursor AHgeuusn5d


Depending on what you mean by ‘page’, you probably want a query? (Show me all results with values between 1 and 10)

If you want to jump ahead in the results, that is fundamentally unstable though.


I see three options if this is a necessary use-case against a shared-changeable dataset:

1. Accept that the results will be unstable as the underlaying set changes. Pagination may either miss or double-include items unpredictably.

2. Store an intermediate result set guaranteed to remain stable for the necessary duration. This will provide stable pagination, at the cost of solving cache-expiry problems.

3. Use or build a version-controlled data store. I don't know of anything in common use, but there is likely something available. This is similar to #2, but moves the work from the application into the data-storage layer. You then paginate against a set version of the data. Imagine something similar to Immutable, but with expiry for unreachable nodes.

Unsure what #3 looks like at scale.


Or... Search not page. which is typically way more useful in reality.


I consider offset-based paged pagination to be a mark of the novice.


Or the mark of someone who asked the users what they wanted.


I don't disagree, but everyone would prefer to have an API that worked over one that doesn't, or one that causes service outages due to database latency. (did anyone ask the users if they wanted the api to work?)

If you're dealing with very small datasets, its fine. I'm an average person using average APIs, which means that when I see offset-based pagination, it's usually on a service deployed and used by a lot of people.

Unsurprisingly, the offset based APIs often include some other arbitrary limit like "offset limited to 10k" or something silly but understandable if you've built an API used by thousands of people before, or understand how databases work.

They're also often superseded by betters APIs that actually allow you to page the entire result set. Then you have a deprecated API that you either support forever or annoy users by turning it off.

So yes, if you are building something non-internal/pet project, limit/offset is probably the mark of the novice.


edit: I just saw another comment of yours, so I see this was meant more contextually than I realized.

Can you explain why offsets would never be a suitable solution? Is there a clear explanation as to why?

I understand how cursors are superior in some cases. But what if you have a site which paginates static data? Offsets would allow you to cache the results easily, overcoming any performance concerns (which would be irrelevant if the dataset was small anyway), providing a better user experience (and developer experience due to simpler implementation details) overall.

I can see that it would be a novice move if you’ve got staggering amounts of records that take a while to query. But that’s actually pretty rare in my experience.


It doesn't even have to be a cursor, you can technically page off of any field you can index (i've written a lot of sync APIs so there are gotchas, but the point stands).

Limit/offset is usually (though not always, as you point out) egregious for anything more than hobbies or small projects. But, I am biased as when I build APIs, I definitely expect some amount of traffic/volume/tuple count, and offset/limit will not do.


Cool, I think we mostly agree. Thanks for the explanation!


I'd also point out that the GraphQL/Relay spec for pagination enforces cursor based pagination: https://relay.dev/graphql/connections.htm. It's also a really nice pagination spec overall, edges/node/pageInfo are well thought out and flexible.

In my professional experience, lots of devs will implement fetching APIs and not consider pagination, and then when the responses start growing (testing data, use, whatever), user experience sucks. Pagination should be a default API design for anything returning an unbounded list of items.


How do you deal in GraphQL when you have a query hitting with multiple nodes which may return multiple cursors for the different nodes? Seems like a nightmare if you have to check and retry manually for each node, maybe there are libraries that take care of it transparently?


How is this different than hitting multiple REST endpoints for different resources with their own individual pagination? If the client needs to glue together paginated resources from two places, it's a hairy problem no matter what the protocol. A backend for frontend that does this under the hood is one solution to both, or a specialized endpoint that does it under the hood.


In the case of hitting multiple REST endpoints for different resources, on each query it's clear what pagination I'm using and combining the data is just appending it to the final list. In each case the partial results are flat and linear.

But in the case of GraphQL, like I may get an incomplete list, or I may get a complete list that will have some fields with missing data, or some combination of both, going down deep depending on the structure of the query. I don't see a clear way to request the missing data without duplicating part of what I already have, and I don't see a clear way to combine the results.

But I may be missing something about GraphQL or something about the tools that makes this simple.

For example something like this:

    {
      users(first: 10000, after: "cursor123") {
        edges {
            cursor
            node {
                id
                name
                friends(first: 10000, after: "cursor235") {
                    edges {
                        cursor
                        node {
                            id
                            name
                        }
                    }
                }
            }
        }
     }
    }


Oh. I think you’d handle it the same as rest, make a new individual field query for that friend by ID and ask for the pagination there. I think that would be manual work for graphql or rest, im not aware of anything that does that automatically. It’s also not a case I’ve personally ran into so not sure.


In the relay JS library you can avoid some duplication with fragments. Fragments are pieces of a graphql query that you can include in multiple queries. If you are using flow, the type system will also ensure that you can’t render a component unless you have used the correct fragment in your query.

That said, paginating multiple resources on the same page is still going to be complicated.


Agreed, so many times I've raised paging with an internal API provider only to be told they can retrofit it later if needed. Then you start to use production data and bam the one call doesn't work.


I disagree. I think pagination is a clumsy error prone kludge. That it performs better than the awful naive approach by additionally burdening the caller doesn’t remotely make it good. The correct approach is a buffered stream with proper flow control. Incidentally that’s what Transmission Control Protocol was invented for.

Probably the nicest way to expose this that fits contemporary sensibilities would be as a lazy sequence that does sensible prefetching and asks the server for more data as it’s iterated through.


> a lazy sequence that does sensible prefetching and asks the server for more data as it’s iterated through

Isn't this an accurate description of cursor-based pagination? You can have the server hold the cursor and associate a request for the next page based on the client ID (like in the PostgreSQL wire protocol), or you can make the flow stateless (at the network level) by handing the client a ticket for the next page.


Pagination implies many requests where as a streaming API would keep to a single request, no? No worry about multiple servers or juggling state.


It's still multiple requests on a streaming API if you're asking the client to notify you when it's ready for the next page. Most DB engines use TCP connection affinity to identify which cursor is in play, but the client still needs to send a packet asking for the next page. Both client and server need to keep the TCP connection open, which is state for both parties.

I see you mentioned gRPC streams in a sibling comment, which are a great alternative to REST-based pagination, but they are highly stateful! Streams are built into the protocol, so a generic gRPC client will handle most of the state juggling that would have been your responsibility with a REST client.


This is a bit inaccurate I think. You can make a RESTful api over gRPC. The protocol is irrelevant. An open connection is not a violation of statelessness. If anything, a streaming API where you can close the cursor upon completion and treat new requests as fully new is a lot less stateful.

We're also talking about APIs in general, not just http REST.

>It's still multiple requests on a streaming API

Perhaps, but the requests can be pipelined. You don't need to wait for the response to complete before asking for more.


The original article is talking about pagination over stateless (at the protocol level) HTTP requests. (Referring to APIs that are offered over stateless HTTP as "REST APIs" is technically incorrect but reflects common usage and is a convenient shorthand.)

gRPC is able to offer powerful streaming abstractions because it utilizes HTTP/2 streams, which are cursor based rather than strictly connection oriented. The state is still there; it's just a protocol-level abstraction rather than an application-level abstraction.

> Perhaps, but the requests can be pipelined. You don't need to wait for the response to complete before asking for more.

That sort of defeats the purpose of grpc flow control, doesn't it?


>That sort of defeats the purpose of grpc flow control, doesn't it?

Why do you say that? I don't know the full implementation details themselves but generally there's no reason you can't safely ask for even more after having asked and consumed some. If you have a buffer of M bytes and ask for M, then consume N, you could immediately ask for N more without waiting to receive all of the original M.

Although, I wasn't speaking about gRPC in that case though. I'm not sure how exactly gRPC achieves back pressure. I was only speaking abstractly about a pipelined call vs many pages. You seemed to claim that multiple requests was a requirement.

But fine, ignoring gRPC, you could possibly tune your stack such that normal http calls achieve back pressure from network stacks and packet loss. Http is built on top of TCP streams, after all. That doesn't make it inherently stateful does it?

Going all the way back, I still think its fair to say that a streaming response with backpressure can be stateless and without multiple requests. If you want to argue that multiple packets or TCP signals are needed then perhaps so, but I think that's a far cry from the many separate requests a paginated call requires and I dont think its accurate enough to conflate them.


> I don't know the full implementation details themselves but generally there's no reason you can't safely ask for even more after having asked and consumed some

I think we're saying the same thing but using different formulations. If you send an HTTP request for a list with 20 items, then get back a response with 10 items and a link for the next page, that is essentially the same as cosuming 20 items over a stream with flow control. The point of returning a cursor in the response and having the client send a new request for the next page is to support a stateful stream over a stateless protocol. In neither case are you waiting for the response to be complete before processing items, since your message indicates that "complete" here means the full result set has been sent to the client.

> But fine, ignoring gRPC, you could possibly tune your stack such that normal http calls achieve back pressure from network stacks and packet loss. Http is built on top of TCP streams, after all. That doesn't make it inherently stateful does it?

That's pretty much how streaming responses are implemented in TCP-based protocols (like the SQL query APIs exposed by Postgres or MySQL). TCP connections can be terminated for unrelated reasons, which is why you don't see this pattern very often in recent protocols. When a TCP connection is dropped and restarted due to, say, network congestion, you have to restart the paginated operation from the head of the list. H2 (and, vicariously, gRPC) streams are built to be more resilient to network noise.

But to answer your question, yes, that pattern is inherently stateful. Pagination has to be, since the server has to know what the client has seen in order to prepare the next batch of results. You can manage this state at the protocol level (with streams), or you can push it to the application level (with cursor tokens embedded in the messages exchanged). The streaming approach requires a stateful server and client, whereas the application-level approach only requires a stateful client.


I think my opinion on close enough also differs from yours (and that's ok!) but you're also forgetting that a streamed/connection based approach only involves a single client and server and state during the connection.

A paged response needs to sync state across many machines. As you have said, you need a clientID, a cursor reference, or a sticky session. That's not nothing.

I think they're inherently different approaches.


> A paged response needs to sync state across many machines. As you have said, you need a clientID, a cursor reference, or a sticky session. That's not nothing.

OK, I think this is what I was misunderstanding in your comments above. You can keep pagination state in a distributed store, but it's not necessary. The implementations I've seen and worked with have all embedded all the information needed to generate the next page of results in the cursor token itself. Kind of like with a JWT or TLS session ticket, there's no need to sync pagination state or use sticky sessions. That approach also provides some resilience in case the server that generated the previous page of results becomes unreachable (something that the client needs to handle by restarting pagination if pagination state is just held in memory by the server keeping the stream connection open).


>Probably the nicest way to expose this that fits contemporary sensibilities would be as a lazy sequence that does sensible prefetching and asks the server for more data as it’s iterated through.

gRPC's streaming API is a good example. It has backpressure built in and you just need to configure the buffer sizes. Clients just consume items off the stream and async writes will suspend as needed.


When dealing with the most common case (I think) for pagination, clients (web/native) requesting data, are you suggesting keep a long running websocket only to fetch data in response to new page requests? What benefit would that afford, given that the requests are pretty far apart in time?

Or are you focusing more on the backend machine to machine case? In that case, some kind of automatic prefetching, lazy loading/yield setup sounds pretty nice if that's abstracted away when you want to iterate over a large dataset from a separate resource server. It's not a pattern I've used before, mainly because it's not often that one server wants to know everything that another server has.


I find pagination really annoying on the client, but the majority of the time it does seem like the client doesn't actually need the whole thing so it's a good (and often very necessary) performance optimization. For endpoints where that isn't the case, it's (usually) not too hard to allow a sentinel value for page size that means infinity, so the client really can get it all in one swoop.


What are you saying that would be different in practice? What advantages would it have?


Not sure I agree with the parent but thinking it through, you have a single connection and a cleaner life cycle. Streaming writes and reads could mean back-pressure is possible. A client can consume items as needed as opposed to mapping usage to a page size.

There's no worry about wasting a full request on the last page with a single item. In fact there's no worry about round trips at all, it's just a matter of buffer back pressure that communicates the need for more or less.

Hmm, when you think about it, its a pretty good way to go.


That’s consistent with what I have in mind. I too think it’s pretty much best of both worlds. You get better efficiency than pagination with the convenience of an API that’s no different from fetching all the data.


I worked with an API like this once, which streamed NDJSON one line at a time. I loved the amount of control that it gave me on the client side!


How do you find on the backend if a page is the last one? The trick I used to avoid extra SQL queries is if a user asks for 10 records I load up to 11 and then return 10. If there were 11 then I return a cursor for the next page, otherwise next page is null. Is there a better way?


Moreover, you can query, like, +3 more, and if the extra is less than 3, return that extra items (short tail) also. This way, you won’t have the last page with 1-2 items only.


How do you deal with cases where results must be ordered by some field and the data changes between page 1 request and page 2 request?


If the ordering changed because the data changed then just show the new situation at that page. If the results’ order is important there are several things you can do (at least):

1. You could save the order under a token and iterate respective to that saved order, only showing the data for the ids that were originally included. 2. Instead of continuing from a page count show the next amount. Any mutation before that point will be invisible without having to store the original result list (only ids).


You could select into a temporary table and return pages of results from that.

You'd need some mechanism to delete the temporary table after a time, of course, but I imagine this is not uncommon.

Another method would be having a last-modified timestamp on every row in the DB; then you could select records at or before the initial query time. Seems like overkill just for this one purpose, but I imagine it might be generally useful information to have in the DB.


I mean there's no great way, the list of tools you might deploy is very long though:

(1) Don't solve the problem, trust the user to figure it out. This is the default (update rows and if the data updates underneath someone then tough) and it works in 95%? of cases or so. Works poorly when computers consume your results.

(2) Versioning tables! Forget updates let's just insert all the things!

    SELECT * FROM projectVersions
      NATURAL JOIN (SELECT projectKey, MAX(created) AS created 
            FROM projectVersions 
            WHERE created <= :time GROUP BY projectKey) AS latest
(or join ON or USING..., use an incrementing version, etc. You need a UNIQUE index on (projectKey, created) which happens to also make this quite performant given the DB sizes involved...)

Deletion is somewhat finnicky here, soft-deletion works just fine as-is but hard deletion brings back issues from issue (1) again. The bigger problem is that if a "project" is a big record then you start copying all this data and your DB grows huge.

(3) Snapshots + Diffs! The previous storage concern can be alleviated by only updating the whole record (a new "snapshot") after 10-20 diffs are recorded, otherwise just store the diff. It's O(1) if you ensure a constant max of diffs... If referential integrity via foreign key constraints is really important to you, the most extreme version of this is a table structure,

                          snapshotId
    [ projectSnapshots ] <----------- [ projectPrimaryContactDiffs ]
           \                                 |
            \                                | contactId
             \   primaryContactId            V 
              -------------------------> [ contacts ]
where each foreign key column has to become its own diff table to enforce the constraint! Then it's absolutely important for the snapshot table to have a metacolumn counting diffs, hah. But if you are not this picky then a single table projectSnapshotDiffs can work.

Other variations of this have extra diff columns sitting on `projectSnapshots` allowing you to change things, possibly just one column which is JSON, etc. ... there are tradeoffs on how available you need the versioned data to be to your database itself for DB queries.

(4) Graph databases. The most extreme form of (3) where we abandon snapshots: maybe every row in some table represents a diff for some column at some time! This ultimately creates a graph database, sentences or "facts" ("Horn clauses") are recorded "at time T, the [subject] [verb]ed such-and-so [object]," you store "record/fieldName/fieldValue" tuples. Sometimes you can also include an integer index to +1 assert or -1 retract facts--queries then sum over the facts, when you find values that are not 0 or 1 you can guess that a _merge conflict_ happened.

(5) Give up and do functional programming :) What this means is, if all of your records are immutable then you get time persistence for free, and you often don't need to copy data as in (2) because you can use structural sharing. The flip side is that everything needs to be accessed by _pointers_ rather than _indexes_, or else whatever is indexed needs to be copied, which can get expensive.

It's worth giving a story for why you would do this. You say up-front, "I want the ability to version a list of Resources for this Project, just like the other fields in my Project Snapshot. When you look at the Project at version 53, you should see the resources that were in that project."

Well, you start with Resources that point to a Project Snapshot or so, but whenever you generate a new Project Snapshot you find yourself copying a bunch of Resources, even when the resources haven't been updated. You maybe insert a ProjectResources many-to-many table, but you're still inserting a bunch into that many-to-many table even when the resources haven't been updated.

The FP instinct comes in when you say "I will have Projects have a reference to a ResourceList and store the reference to that ResourceList, so that I don't have to do this copying when I don't update the resources." Now the arrow has been (somewhat) inverted, Projects point at Resources (somewhat) rather than Resources pointing at Projects. Problem solved, sort of:

    [ projectSnapshot ] ----> [ resourceList ] <--- [ resourceListItem ]
                                                         /
                                [ resource ] <-----------
The problem still comes back to bite you later when it turns out one of your biggest power-user clients has 50,000 Resources which they are constantly updating, and now each time they update a Resource they trigger the creation of a new resourceList and hence the copy of 50,000 ResourceListItem records.

The story is that working backwards you say "ok I'm going to bound their update so each of their updates only inserts, like, 50 ResourceListItem records max," so now you have some notion of a ResourceListPages table, the resource belongs to a page of 50 or so resources in the ResourceList... except now you're still inserting 1,000 pages when you reconstruct every ResourceList. It got way better but it's still not scaling. In desperation you reach for a recursive structure and write out this comment in some source code, haha:

    /**
     * A ResourceList of order 0 is a page of up to 50 Resources which
     * foreign-key to it, for balancing reasons we store here a "count" of how
     * many Resource records point to it.
     *
     * A ResourceList of order N > 0 is a collection of pointers to up to 8 
     * ResourceLists of order N-1, called subList1, subList2, ... subList8, 
     * and the count is the sum of counts of all of the sublists.
     */
You'll have to be in a rather unique case before you run into this, haha, but when you do, you'll be glad if your database supports recursive queries with the WITH statement in SQL. And then SQL will be happy to fetch all of the ResourceLists of order 0, aggregate all of the Resources that point at them, and order them into reliable pages however you like.


You could ask for a count of records and divide by the page size.


Counts become incredibly slow on some databases like Postgres. It needs to do a sequential scan, which results in slow(er) performance on even a ten- thousands of rows. When operating at scale, this is very, very slow.


From what I understand it's not slow if it can use an index, but the criteria for that are rather hard to understand as a normal user. If you vacuum often enough and the visibility map is current, it doesn't need a full table scan.


Isn't count slow on most databases? Either it ends up in table scan or index scan. But if the index is huge, then that will also take a lot of time.

I wonder how do Big Co solve this counting problem.


This seems unlikely to depend on which database software. It could easily depend on what indexes exist for the query.


This is what I do, since counts should cause minimal overhead (I've never had to deal with huge databases). But neither does the N+1 records trick which OP describes, that's clever. :)


Counts can cause significant overhead on even millions of rows.


If I were to ask the number of even numbers greater than zero and less than 10 billion, you could very quickly give an exact answer. If I were to ask the exact number of prime numbers on that same range, it would take much longer to find the correct answer.

In the same way, asking for counts may or may not be a quick answer, depending on both the table structure and the query being performed. Query the number of customers who started this year, on a table indexed by start date, and it's incredibly cheap. If that same table is only indexed by last name, then counts become more expensive. I'd say that the advice to return the counts has an implicit addendum to make sure that your database is appropriately indexed such that your typical queries are cheap to perform.


So I'm coming at this from a Postgres POV, IIRC MySQL can take some shortcuts for COUNT(*). Constraining the query will help but counting is expensive as a sequential table scan is always required.

Meanwhile I can only think of one use case for pagination with exact row counts – a user facing application. If you're preallocating storage you can get by with an estimate. If you're consuming an API and want something from near the end you can flip the sort order.


It's slow and expensive in MySQL too, assuming you mean InnoDB storage engine. Recent versions of MySQL can parallelize a COUNT(*), but that only helps so much.

The "shortcuts" you're thinking of may be from the MyISAM storage engine days. Since MyISAM doesn't have transactions or MVCC, it can maintain a row count for the table and just return that. But MyISAM is generally not used for the past decade (or more), as it just isn't really suitable for storing data that you care about.

Storage engines with transactions cannot just store a row count per table because the accurate count always depends on your transaction's isolated snapshot.


Not necessarily in postgres. If an index is available it may use that, though it will still have to refer to the visibility map to see if any row is available to the current transaction (cheap check), and if it can't determine from that, then and only then does it need to also refer to the heap / actual table (slower check).


Is that a recent change? Back in 2016 the Citus guys claimed that:

  There is no single universal row count that the database could cache, so it must
  scan through all rows counting how many are visible. Performance for an exact
  count grows linearly with table size.
https://www.citusdata.com/blog/2016/10/12/count-performance/


Apologies, I should've clarified. I just meant postgres doesn't necessarily have to scan the actual table to get the count in some cases if an index is available, not that it caches the actual count.


So that's not the greatest quote but the article shows postgres having to fall back on a sequential scan for a naive COUNT(*). The author goes on to mention that as of 9.2 Postgres can do an index scan (in the context of SELECT DISTINCT):

  But now we come to a quirk, SELECT COUNT(DISTINCT n) FROM items will not use
  the index even though SELECT DISTINCT n does. As many blog posts mention (“one
  weird trick to make postgres 50x faster!”) you can guide the planner by rewriting
  count distinct as the count of a subquery
To me it seems like you can trick the query planner into doing an index scan sometimes, but that it's a bit brittle. Has this improved much since?


I think the wiki explains it well

https://wiki.postgresql.org/wiki/Index-only_scans#Is_.22coun...

It says that index only scans can be used with predicates and less often without predicates, though in my experience I've seen the index often used even without predicates.


Relevant quote:

  Index-only scans are opportunistic, in that they take advantage of a pre-existing
  state of affairs where it happens to be possible to elide heap access. However,
  the server doesn't make any particular effort to facilitate index-only scans,
  and it is difficult to recommend a course of action to make index-only scans
  occur more frequently, except to define covering indexes in response to a
  measured need
And:

  Index-only scans are only used when the planner surmises that that will reduce the
  total amount of I/O required, according to its imperfect cost-based modelling. This all
  heavily depends on visibility of tuples, if an index would be used anyway (i.e. how
  selective a predicate is, etc), and if there is actually an index available that could
  be used by an index-only scan in principle.
So yeah, I don't know if your use case is exceptionally lucky or if the documentation is just pessimistic (or both). Good to know you can coerce the query planner into doing an index scan though.


Thanks, that's good to know (will have to perform some tests at some point). I do have a few tables (namely logs) that will have millions of rows eventually over time. But there I've also been planning on rotating them whenever that becomes a problem, so I'm not overly concerned. Or experiment with OP's solution.


Sure, but that point you've really got to start thinking about architecture.

Do you really need to paginate millions of rows and count them? Can you distribute the database and parallelize counting? Can you save counts somewhere else? Can you use estimates?


Query Limit +1 and then check the presence of the last item (Limit +1 item). Never use `count`. They are usually slow, and you might need to add more filters in the future. I will add this to the article.


That’s what we do, I’m not sure there is another way other than making a request for the next page.


Just let people query for an empty page every now and then, no big deal.


So cursor pagination is just keyset pagination, except instead of using a column value directly you use an opaque string that gets translated back to a column value. So you can modify how the cursor gets translated back to a column value anytime without breaking API compatibility.

That said I'm not sure if everyone agrees on that definition, from what I know people do consider keyset pagination and cursor pagination to basically mean the same thing.


I thinks that’s what the article is saying. I found the descriptions didn’t really explain the differences very well.

We use a cursor that combines the sort key and normally the ID as a secondary sort and pointer value, otherwise you can’t paginate through data where the sort column has duplicates. We don’t really do much to obfuscate these, but we don’t document the structure and people who try to tweak the values won’t get far.


One advantage of opaque cursors is that they can include information that's not part of your public API. For instance, if you allow soft deletes and do not return deleted items in the list, you can end up with a massive stretch of deleted items separating two live results. You can leave a note in the cursor indicating what record was last scanned, and that's not possible in keyset pagination.


> So cursor pagination is just keyset pagination, except instead of using a column value directly you use an opaque string that gets translated back to a column value.

HashIds is a popular solution if those columns are numerical or can be represented numerically (e.g. timestamp).

https://hashids.org/



Hey thanks for the feedback. I will update the article.


Wish this article would have gone into details about how cursor based pagination is actually implemented. Like they did for the other options by showing SQL. Is this something the database engine has to provide support for?


Check https://use-the-index-luke.com/no-offset as a better reference.

But in most SQL databases, cursors are something you implement and parse at the application layer, and translate to a WHERE clause on the (hopefully indexed) column you're ordering on. That turns the O(N) "OFFSET" into a O(log(n)) index seek.


Many databases have actual cursor implementations which are transactional. That means you’ll get a consistent view at the point in time you created the cursor.

That said, they tend to live only as long as the db connection (with some exceptions), so yeah you need some application work to make it sane.


How is a cursor different from key set in that case?


It's not in that case, cursors can be implemented however you want though. Its an abstraction, that prevents you from having to change the schema of the API when you change the implementation.


You can do something like the following in application code.

Imagine a table with the following schema/data: id=1,created_at='2022-28-05T18:00Z' id=2,created_at='2022-28-05T18:00Z' id=3,created_at='2022-28-05T18:00Z'

To retrieve articles in order of descending creation time, our sort order would be: created_at DESC, id DESC

The last item in the ORDER BY query should be a unique key to ensure we have consistent ordering across multiple queries. We must also ensure all supported sort columns have an INDEX for performance reasons.

Assume our application returns one article at a time, we have already retrieved the first article in the result set. Our cursor must include information for that last records ORDER BY values, e.g. serialize('2022-28-05T18:00Z,3'), for example purposes I will use base64, so our cursor is MjAyMi0yOC0wNVQxODowMFosMw==.

When the user requests the next set of results, they will supply the cursor and it will be used to construct a WHERE (AND) expression: created_at < '2022-28-05T18:00Z' OR (created_at = '2022-28-05T18:00Z' AND id < 3)

So our query for the next set of results would be: SELECT * FROM articles WHERE (created_at < '2022-28-05T18:00Z' OR (created_at = '2022-28-05T18:00Z' AND id < 3) ORDER BY created_at DESC, id DESC LIMIT 1

For queries with more than 2 sort columns the WHERE expression will slightly increase in complexity but it need only be implemented once. If the sort order direction is flipped, make sure to adjust the comparator appropriately, e.g. for 'DESC' use '<' and for 'ASC' use '>'.


Isn't that basically just a trivial extension of what they article calls "keyset based pagination"?

I'm not complaining since this is also what I usually do, I'm just wondering of theres more to it.


Oftentimes under the hood, cursor based paging IS keyset paging. The difference is that the keyset is obfuscated, which can allow the implementation to change without changing the API call.


Makes total sense. It also seems trivially automatable on the surface. Echoing GP: Is this an existing DBMS feature? If not, why not?


How do you handle removal?


I haven't had a need to implement this but something like the following should work to provide the same* records regardless of insertion/deletion.

* data of the record may have been updated, but if there are 1,000 records total, paginating in either direction will always return those same 1,000 records regardless of insertion/deletion.

- Filter by created_at <= time the endpoint first returned a result.

- Utilize soft deletes, where deleted_at IS NULL OR deleted_at < time the endpoint first returned a result.

Any keys allowed to be used in the ORDER BY query should also be immutable.


Some database engines will support Cursors like this, but from what I can gather Cursor based navigation is a wrapper around KeySet or page navigation, but base64 encoded parameters.

Making the implementation details of pagination unimportant to the API layer the user uses.


No need for anything special, you would just query with something like "SELECT whatever FROM Table WHERE cursor>value ORDER BY cursor LIMIT n". Obviously there needs to be an index over the columns of the cursor, but any reasonable engine should allow you to add that effortlessly.


The complication comes from having a cursor over results that are sorted on multiple columns. It's not incredibly difficult to implement, but it can certainly be annoying to do, especially if you want to support before+after cursor as well.


The API improvement proposals (AIP, yes it's a terrible acronym) are a rich source of API design wisdom. It assumes protobufs but the majority of proposals would work for any API.

The relevant one is https://google.aip.dev/158, which recommends cursor-based pagination and covers different pagination use-cases, including:

- support for different sort orders - mentioned in article as weakness of keyset pagination

- support for skipping results - mentioned as a weakness of cursor-based pagination

- why to use opaque page tokens


This is a good reference, but IMO page tokens need to expire, even if you aren't storing any data for them. Expiring tokens make it possible to change pagination formats over time (you only need to wait the expiry time while supporting both formats), eg to improve efficiency.

Also I'll add that if you support a pagination API that accepts args (e.g. to filter on different dimensions) you should include a hash of the args in the opaque next page token, and fail the call if the args change. This prevents some nonsense scenarios where you started paging by scanning index A but the user switched filters such that you'd instead scan index B, but you don't have the key for that column.


Your second point is noted in the https://google.aip.dev/158 guidance.

> Request messages for collections should define a string page_token field, allowing users to advance to the next page in the collection.

> * If the user changes the page_size in a request for subsequent pages, the service must honor the new page size.

> * The user is expected to keep all other arguments to the RPC the same; if any arguments are different, the API should send an INVALID_ARGUMENT error.


Expiring tokens has a section in that link:

> Many APIs store page tokens in a database internally. In this situation, APIs may expire page tokens a reasonable time after they have been sent, in order not to needlessly store large amounts of data that is unlikely to be used. It is not necessary to document this behavior.


Hey! Nice! I do think clients shouldn't store the cursor. This technique is neat, I will add it as an option to the Article (I’m the author).


For large datasets, consider avoiding pagination complexity by having your api return a (asynch) job ID, and dumping the whole resultset to a file which can be fetched (over https) via the jobID.

I have done this with both Postgres/S3 and Redshift/S3 backends and presigned URLs and looks like I could do it with Snowflake too.


Hey, I'm the author of the post. Thanks for the feedback! I will update the article with some of the insights from here.

I wrote the post when I was investigating cursor-based pagination, which I had to implement at Shopify. We decided to use Cursor-based pagination since offset performance wasn't good enough. The larger the offset, the slower the query.

We decided to add the cursors to the LINK header page of each request response. The header contains two URLS: the previous and next page.

We saw a tremendous impact! In some cases, iterating over the pages sequentially using Cursor was much faster than sending parallel requests using Page-based pagination.

I built a library where you can compare the performance of Cursor vs Offset for a Shopify Store iterating on specific resources. This helped validate the approach at Shopify and prove to partners that it was worth migrating to Cursors!

More information here https://twitter.com/IgnacioChiazzo/status/153071174122657382...


The limitation of cursor-based pagination - where you only know the previous, current, and next page's cursor identifier - sounds ideal for infinite scrolling applications, where the user does not get the option to pick a page.

Another solution or optimization I've seen is in forum software, where if a user executes a search, the IDs of the search results are copied into a "search results" table with the equivalent of a session ID; that way, the data remains stable between paging, and the data set over which pagination has to be done (e.g. via offset/limit) is very small; joins can be used to link to the items in question in the search results (e.g. posts, threads), or all the items that need to be shown in the paginated search results can be copied over to the ephemeral search results table.

And of course, since it's ephemeral and memory is cheap these days, the whole search result can be cached in server-side memory.


Are these actually server-side cursors or keyset pagination? For keyset pagination you need essentially the last value of the columns you used to sort by, but this seems to have some random or encoded cursor value. I didn't think server-side cursors would be used at that scale so this has to be keyset pagination, so I think I'm missing something.

And while I really like keyset pagination, it is annoying to implement in some cases. If your sorting column is not unique you need to sort by multiple columns. And that is really annoying and less efficient if you don't have row value comparisons. And not all databases and especially not all ORMs support those. I'm still waiting on EF Core support here, it looks a bit like this might happen but it's not certain yet.


By server-side cursors do you mean like open connections to MySQL reading a stream of query results, or something like that? I don't think that's feasible for most websites. The biggest issue is that you'd have to guarantee that when someone clicks "next page", their browser ends up talking to the exact same machine that it got the previous page from. But websites running on big fleets of machines don't usually work like that; instead your queries are usually somewhat randomly distributed across a whole tier. You might work around that by trying to keep a persisitent connection open from the browser to the server, but you'd lose that connection when you refreshed the page or when the server restarted, or if you lost WiFi for a minute. Making hiccups like that visible to the user is pretty frustrating. It's a lot more robust to use some sort of cursor design that doesn't rely on any persistent connections.

Some other problems that crop up:

- A lot of websites expect to serve most results from cache, and to only hit the DB a minority of the time, but to get your hands on a real DB cursor you'd have to hit the DB every time.

- Not all DB's support moving cursors backwards. Maybe most of them don't? I'm not sure.

- Open DB connections are usually a somewhat limited resource.


Surely a keyset, even if it's multi-key, is more performant than an offset query. Offset queries usually require the previous rows to be computed first, so large offsets are super slow.


It is, but it's not a cursor in the traditional sense.


EF core now supports it with PostgreSQL

https://github.com/npgsql/efcore.pg/pull/2350


>but this seems to have some random or encoded cursor value

Maybe the actual key value is stored server side for obfuscation/optimization reasons?


“Cursor-based pagination is the most efficient method of paging and should always be used when possible.” Facebook API Docs

Why is it more efficient than key set pagination? According to this article https://archive.ph/2021.04.30-125536/https://medium.com/swlh... the only difference is that the value of the key is encoded to allow the implementation to change


Both approaches use the underlying index in the database, but cursor approaches need to look at far fewer results in order to return the page. If you are deep into the result, your offset could be high, like 1000 or so. The database would have to start at the beginning and go through 1000 results, especially if filters are in play, in order to get the the N items in the page.


OP talks about key set pagination which also operates by a WHERE clause, not about simple offset-based.

According to other commenters cursor based is the same as key-set, but without exposing the internal format in the public API, so it can be implemented as easily as taking a reversible “hash” of a last seen index and the sort order.


You are correct. I misread the article. We use different terms for this stuff internally, having both simple offset based, and continuation token based, which is essentially a mix between what the author describes as key-set and cursor based and using a different FE API.

IMO, the performance would come down to the underlying database query: you could get faster key-set in some cases, and faster cursor based in others if the query was hitting a good index and using fewer joins. Both approaches, "key set" and "cursor based" let you search based off index, so I don't see how one could be inherently faster?


But how does this make it the most efficient method of pagination Vs key set. To me it seems this is an overstatement.


It is not more efficient than key sets as it is basically the same.


Although https://news.ycombinator.com/item?id=31541822

Lazides comment suggests some DBs let you expose the cursor which I haven't heard of before. But I agree with you unless what lazide said exists.


Cursor based pagination makes sense. At a certain scale every part of your API becomes something you need to support, and that includes pagination. This makes it super difficult to change the storage you use, as suddenly you need to support integer based offsets rather than opaque tokens.


Cursors also mean you get more predictable behavior when interacting with data that's changing under you. Imagine scrolling back through an active chat, for example -- you wouldn't expect each page to contain some of the same chat messages that are already on screen.


Pagination in API's are a PITA as you have to make many requests to get all the data. 99% of the time you want all data, not just the first page. For the 1% use case you could use http-range or page. Pagination is often implemented as an premature optimization.


> 99% of the time you want all data, not just the first page.

This certainly doesn’t seem to be the case. Take viewing this very website, for example… are you suggesting that most people want to see all posts of all time, rather than just the first page or two? Paging is used extensively for feeds, and feeds are crazy common these days, and people rarely want to see all posts ever on a feed.


The difference is that this website is accessed by humans, while API's are accessed by programs.


APIs are accessed by both humans and programs. Also, programs don’t want everything every time, either.


In my experience the only reason you would want all the data is if the API lacks good filter queries. IMO, making it a PITA to download all data is kind of a feature.


> Pagination is often implemented as an premature optimization.

or for forwards-compatibility.


8 minutes of wire time is bad and pagination is usually part of server side frameworks...


Related: Do you think SDKs for these APIs should themselves require you to interact with the cursors?

If I ask a SDK to get me 100 items but internally it returns only 50 before needing to make another call with a marker or cursor It would be nice if the SDK had an option to handle this for me (possibly exposing cursors still for those who really want it).


SDKs should typically return a lazy iterator in languages which support it, so caller do not even care about pagination and just iterate on it until they don't want any more data; while the SDK fetches new pages when needed.

In pseudo-Python:

    def get_stuff():
        url = "https://example.org/api.php"
        while url:
            response = requests.get(url)
            yield from response.json()
            url = parse_link(response.headers["link"]).next

and you call it with:

    for item in get_stuff():
        print(item)


The AWS Python API users “paginators” to allow you to iterate over results with a single for loop and it handles the pagination in the background. The underlying API - which is also exposes - uses and opaque “NextToken”


What if it fails after 50?


Maybe return the items it found so far and provide the cursor it failed on? Have a way to optionally provide the cursor? I can think of some ways I'd implement this in an SDK. Probably for simplicity I'd have some sort of JSON status field indicating the request (traversing all cursors) completed. If it's not complete I still would look for the next cursor it returned and do the usual thing manually.


Can’t we do both offset and keyset-based pagination? If a user wants a specific page, let them specify it, return them links to the next and previous pages, using filter + limit.


I don’t like the term “cursor-based” pagination. Database cursors are different from this, have been around for decades, and can actually be used to implement pagination (albeit with requiring the transaction to be kept alive between pages which can be… problematic).

Perhaps, “token-based” pagination would be a better term?


I agree. I was unsure if the author meant db cursors which would be keeping transactions alive. That has obvious scaling issues. But reading the comments here has me convinced they are talking about application generated cursors. I think this is a pretty reasonable confusion and it would benefit from added clarity, like naming them differently.


> Perhaps, “token-based” pagination would be a better term?

Let's do it. When I was first learning about it, I wondered how the concept related to database cursors, if at all.

It's really just a pointer to the next item at its most simple. But it really is just any opaque blob. Maybe you want it to be a jwt/encrypted blob with a request counter so you can track request usage by specific accounts (along with the pointer). jk that's probably a terrible idea (too entangled). Idk, just came up with it. Point is, you could do something like that, it's just a blob opaque to the client.

So I like "token-based pagination".


This is highly aggravating and confusing when hipsters try to coopt terms that had a defined meaning for decades. The ones that grate me are "retina display", "real time", "x-nanometer node".


Thank you, now I don't need to write this comment


In 32 years of software development, it fascinates me how ever much more time I spend just moving/translating data between domains. I used to model things more, and do human interaction more. Now it seems like a dominant task I do in any of the products I work on is just shifting data from one place/format/representation/node to the other.


Let's rewrite the microservices wiki as bit:

"A microdata architecture – a variant of the data-oriented architecture structural style – arranges data as a collection of loosely-coupled data. In a microdata architecture, data are fine-grained and the relations are lightweight."

Likely most experienced data architects are now laughing at what a dumb idea the above is. That's the same reaction I had when I first heard of microservices - "that's obviously a bad idea; ppl aren't that dumb", I thought.


But big monoliths/spaghetti balls of code aren’t great either, you’d agree with that?

It seems like we don’t have the ability to do “all things in moderation” as an industry very well. We’re in constant search of The Silver Bullet, that if we just apply it, will… buy the world a coke and live in perfect harmonies or something. And we go through the cycle again and again and again.


Ah, yes...the GRIMM stack: General Repetition of Integrations, Migrations, and Misery


“Modern day best practices”…


I have seen a variation of the cursor based approach in Intercom API [1] but with a fixed cursor (while the example shows a different cursor on every request), which also does not allow you to create another one within 1 minute. It feels almost like a database server cursor, but anyone wants to guess how this is implemented?

[1] https://developers.intercom.com/intercom-api-reference/refer...


> There is no way to skip pages. If the user wants page X, it needs to request pages from 1 to X.

Huh? The client can ask the server to generate a new cursor, any desired amount of items ahead in the list.


With opaque cursors, how does a client specify "a desired amount"?


You just say "this cursor + 5 pages", for example.


You send `GET /items[?cursor=<cursor>]` to my server.

I respond:

    [
        "items": [ {...}, {...}, {...}, ... ],
        "next_cursor": "77963b7a"
    ]
How do you request "this cursor + 5 pages"?


GET /items?cursor=77963b7a&add_to_cursor=5pages


I strongly suggest not doing this since the application will be implementing `offset` in that scenario. This is the same as having Page-based pagination but with a cursor that acts a filter/sorting blob.


If the server supports this, which they won't, because they don't want to support pagination.


Moving forward and back in a cursor (pages) is easy... but how about random pages?

i.e. this forum, HN. If there were ten pages of comments, going from page 1 to page 2, or vice versa is trivial with a "after" or "before" query. But what about jumping directly to page 3 without previously visiting pages 2 or 4?

So far I've implemented a hybrid... pages still exist, but next and previous page (the most common actions) are cursor.

Is there a better way?


If you use ULIDs for IDs (sortable by time, millisecond resolution) and a tombstone field (nullable deleted_at is a good one) then you have a very stable collection that guarantees no new objects will be inserted/deleted before the write head - it is append only in effect.

You can then do some cool things, especially if your objects are evenly distributed and especially if you don't need exact page sizes or can over-query and hide them as needed.

If you then know the approximate frequency of objects, you can then map some linear scaling to an approximate range of ULIDs. Basically German tank problem in reverse.

https://en.m.wikipedia.org/wiki/German_tank_problem


IMO if a solution requires users to address data by page, it's a sign that the search functionality is not good enough.

In general a user doesn't want to find "page 3", they are looking for a record that has certain characteristics. They shouldn't need to think about pages, just search terms.


Another downside of offset-based pagination is less consistent handling of data changes during pagination.

If any row is inserted anywhere before the page you requested, you'll see some other item twice (once at the end of the previous page, then again at the top of the next page). Similarly if any item was removed, you'll skip over some other item when paginating due to every page shifting by one.


I’d take it further: offset based pagination failure modes look different based on the implementation. If you have API clients, you’re exposing that implementation detail to them.


That was mentioned as a con for offset based pagination.


I think the real question here boils down to semantics:

1. Do you really need to have page numbers and if so, when?

2. Do I really need to skip pages or go to arbitrary one and if so, when?

3. Can't I just show you first N rows forever, and demand that you limit results by tweaking your filter? What is the N I would allow this.

4. When do I show the count of items, and when not?

6. Do I need an estimate or precise count?

7. Shouldn't I just infinite scroll up to some N? How does that limit the user, can he skip some pages in that way or what happens when he back and forth from the item details to the results list?

8. How big are those lists? Do they grow forever and if so, is it frequent or not? How many new items you expect per year? Do users always need to use filter or they need a "gimme everything" case? Can we delegate to client some of the pagination?

Depending on answers to those questions (and more), we can determine technology behind. We know that COUNT(*) is expensive and that OFFSET sucks, but we also know that it might not matter at all in some cases (premature optimization) and it might be deal breaker in others.


If you're deciding to paginate the output from an endpoint, please, I implore you, think about how much data you're returning to the client per page. I have worked with so. many. APIs. where the max page size is "100 items", and each item is … 300 B, for a total page size of ~30 KiB.

And then if you want to pull down the entire result set, you end up drinking it through a coffee stir, and it takes an absurd number of usually serialized calls. Latency kills. 100 calls later and you've downloaded a whopping "big data" level of 3 MiB. That would have fit just fine in 1, maybe 3 API calls.

By far the worst offender I've run across is Azure's Container Registry, where a combination of ridiculously high latency and small page size results in something like 3 MiB of container metadata taking like 5 minutes to fetch. I remember bug reporting it (won't Fix! gaaaahh) and IIRC, we computed the overall throughput to be ~56 kbps.


Nice write up!

I've really struggled with Haskell's lack of support of pagination, and worked on at least two "inner-source" libraries that offer what the author is calling "Key Set" or limit/offset and Cursor Based as Servant combinators.

What I've found, is that "Key Set" is much more ergonomic for library authors to implement, if they have to write the database/sql calls themselves, although "Cursor Based" is technically more efficient.

It's my opinion that the standard, out of the box behavior for an endpoint should be to paginate every single endpoint that returns a list of results, but in order to do that easily you'll need some framework support. More mature frameworks have this right now, and IMO this is one of the weak points in using a less supported language like Haskell. You'll either have to roll your own solution, or rely on spotty library support that does that pagination operation in CPU.


Kinda related to this discussion about pagination APIs:

One year ago I had to design an API [0] for paginated nodes which also allows nested nodes. Overall there would be more than 100.000 nodes in the database.

On the UI side there would be a tree table to display the nodes. First you would fetch page 0 with 100 nodes and from there you could either perform a paginated fetch for the next page (page 1) or a nested fetch for page 0 for one of the initially fetched nodes. So essentially the API allows paginated and nested fetches where each nested fetch would be a paginated fetch too.

We used cursor based pagination, because there would be added and deleted nodes eventually in a collaborative environment. It was definitely a fun challenge!

[0] https://www.robinwieruch.de/react-tree-list/


Continuation Token is a better term than cursor at it avoids confusion with DB cursors.

https://phauer.com/2018/web-api-pagination-timestamp-id-cont...


The Google Calendar API leaves a lot to be desired (documentation, etc), but over time what I've found is that it's really well-designed once you understand it.

In their case, they went with a "pageToken" variable which is basically a cursor.

The API has been around for years.


It's a product thing. For visual pagination, people are used to skip pages. So you have to use offsets. You can limit the the number of pages you can request and the size. For machine api's, you can design the most efficient solution. Which is probably some streaming solution. But these are difficult for clients. So usually you compromise. Most of the time I use pages because I don't want developers to churn of this is too hard for them to use keys or cursors. You can also use a schema based tenant system where applicable which makes the problems easier with smaller indexes and tables to sort.


Sometimes the data is so big, and performance is critical that you don't have any other option than not to use offset.


I tried an experiment a while back to handle this using custom units for the "Range" HTTP Header.

The biggest drawback is that the expectation is still that data would be valid concatenated, so (for example) multiple results would be expressed as multiple JSON values concatenated together, rather than entries in a single array.


If I take keyset approach then obfuscate my sorting key+value in some token it becomes a "cursor"? The only point in doing this is to stop abuse of large offsets?

Also article states a con of Keyset is you can't use an offsets. Technically you can, just the same a pagin (WHERE ID > X ORDER BY ID LIMIT Y OFFSET Z)??


In my ecosystem of choice, I wish Spring/Spring Data had support for anything other than page- or offset-based pagination. But none such that I could find, last time I went looking.


A REST API can't use page based pagination because it's not stateless. It is also not cachable.


It's exactly as RESTful and cacheable and stateless as common websites, eg the HN frontpage or a newspaper homepage. I think your bar for when something is RESTful or not is too high to be useful.


Edit: my bad, I was looking at the wrong 'more' button.



Why? The server doesn't need to know if I already requested any previous page when I request page 9. And if the data doesn't change it can easily be cached.


"Rest in Practice" (if I'm remembering the right book) would disagree strenuously with you.

Once you're doing REST instead of just "somewhat sane JSON responses" then the response has self-descriptive elements, including links to related resources. And importantly, URLs for older resources become static very, very quickly. Not unlike the internals of subversion or git.

The trick there is to realize that O(1) = O(2). If you're trying to show 25 items per page, you do not need a query that returns the most recent 25 items. That creates a situation where every single interaction is distinct from every other interaction.

Asking for 25 results and everything newer than that is barely more complicated than pagination already is.


if the page, limit and offset are parameters it is.


If you query the first page with a limit of 10, then insert 10 items, do you get the same items with the same parameters?


Using that logic you can only have restful APIs if your application is useless and data never changes (which is fine for me, I hate debates about restfulness)


If you really want to cache the results use e-tags based on a hash set on write, or just include that hash as part of your request as well.


cursors result in expensive server side state. if ordering is less important an alternative is to make cardinality queries cheaper and then allow the client to request sections thereby making the scheme stateless for the purposes of the server.

so one query returns an approximate cardinality of the result set or a partition count based on approximate size of the result set and a fixed request response size.

the second to n queries are gets, but the gets include the partition count and the partition requested. the server uses a hashing scheme to either reduce the input data used for the construction of the response or filtering of the response itself.

use of numerical partition ids and hashing allows for the response to maintain consistency over time as well as pushing of the sharding/selection scheme as deep as necessary into the data layer to avoid scalability issues.

requests that are too big are cancelled and given abuse points.


Cursor based pagination is a good idea for APIs because it allows for more flexibility in how data is displayed to the user. It also helps to prevent large amounts of data from being returned all at once, which can cause performance issues.


> It also helps to prevent large amounts of data from being returned all at once, which can cause performance issues.

Isn't that the point of all pagination, regardless of how it's implemented?


> It also helps to prevent large amounts of data from being returned all at once, which can cause performance issues

You can also limit page size and accomplish the same thing.


All pagination can help prevent large amounts of data being returned all at once.


> Bad performance for large OFFSET in SQL. When doing OFFSET Nin SQL, the database needs to scan and count N rows.

For user-facing apps, you’ve failed pretty hard already if users can’t find what they need on the first page or so.

If your offsets are big enough for this to matter, I’d rather spend time on why users are so many pages in to the data than optimizing the performance of later pages.

(Processing clients, on the other hand, should query for what they need. Processing in batches may make sens, so there cursor- or what they call keyset-based “pagination” makes good sense. Though in the case of processing clients, I wouldn’t call it “pagination”… it’s more like batches. I’ve mainly used “kelset-based” pagination for processing events, which can alleviate some of the “cons”.)


Users do fuzzy searches across data all the time. They may remember approximately when they last saw a record but not its exact name, or their recollection may be close but imperfect.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: