Hacker News new | past | comments | ask | show | jobs | submit login

You should never design an API that uses offsets for pagination unless you're dealing with small amounts of data (<10000 rows). Cursors give you far more flexibility and if you want to be lazy, you can just hide an offset in an opaque cursor blob and upgrade later on.

I don't think I've used offsets in APIs for at least 10 years. Lightly-obfuscated cursor tokens are one of the first things I build in web projects and that's usually less than an hour's work.

If you _really_ need the ability to drop the needle in your dataset with pagination, design your system to use pseudo-pagination where you approximate page-to-record mappings and generate cursors to continue forward or backward from that point.




Iirc digitalocean API uses offsets. We had to write special code to handle the possibility of the list changing while it was being queried.


When I worked for a news site, we moved all listing endpoints from offset to timestamp based pagination because offset based pagination assumes the data being queried will remain stable between requests. This was, of course, very rarely true during business hours.

Duplicates or missed entries in the result set are the most likely outcome of offset-based pagination.


Do any of these avoid that scenario? Unless you are paginating on `updated_at ASC` there is an edge case of missing new data on previously requested pages.


Missing data that began existing after you started querying is usually OK. If you had requested the data 1 second earlier, it wouldn't have been returned anyway. With offset pagination, you can miss data that always existed. This can make you believe the item that previously existed had been deleted.

Be very careful with timestamp or auto-increment id pagination too. These don't necessarily become visible in the same order since the id or timestamp is generated before the transaction commits unless your database has some specific way of ensuring otherwise.


What do you use then that has the same order as rows becoming visible?

We use an auto-increment id, and lock inserts on the related account (which always limits the scope of the query).

The only other (stateless) way I can think of is to somehow fiddle with transaction numbers linked to commit order.


Sorry for replying very late. I've used a similar technique of locking a "parent" row while adding items to a corresponding set. It works great as long as you can figure out a relationship like that and it's fine-grained enough that you are OK with the locking.

In traditional databases, the number linked to the commit order is usually the LSN (Log Sequence Number), which is an offset into the transaction log. Unfortunately, you can't figure that out until your transaction commits, so you can't use it during the transaction.

A hypothetical database where you could see your own LSN from within a transaction would require transaction commit order to be pre-determined at that point. An unrelated transaction with a lower LSN would block your transaction from committing.

In non-traditional databases, this could work differently. E.g. in kafka you can see your partition offsets during a transaction and messages in that partition will become visible in offset-order. The tradeoff is that this order doesn't correspond to global transaction commit order and readers will block waiting for uncommitted transactions (and all the other things about kafka too).


Exactly.


Something always feels off about repopulating lists that people are working on. Like that concept needs an entirely different display method.


It’s what happens with stateless protocols and offset based pagination 100% of the time, but most people don’t notice it.


how do you jump to arbitrary pages with cursor pagination?


You don't, but if your cursor is just a position in some well-defined order, you can sometimes craft a cursor out of a known position. Continuing the book metaphor, instead of skipping to page 100, which has no semantic meaning, you could skip to the beginning of the "M" section of the dictionary, or see the first 10 words after "merchant".


Forum posts are the classic pagination example (in my mind), but I feel like the type of cursor pagination you’re talking about might work better.

Skipping forward by date, by username, or even by concept might make long threads much quicker to scan through and understand.


You generally don’t, unless you want to linearly scan through n pages. If your API is using offset or page numbers and the underlying collection is also having items added or removed, the behavior is undefined or straight up wrong. I think it’s okay to use offsets or page numbers in cases where the collection is static or where it’s acceptable to occasionally skip over an item or see a duplicate item while paginating.


Or where you want people to be able to send each other links of the ‘look on page 11’ kind.


Going with the point re: cursor links, if your URLs look something like this then they would be shareable and more stable than page=11.

  ?startWith=<item_id>&sortBy=<alpha|datetimedesc|whatever>


Just impossible to navigate to by voice. In the office here people often call out the number of the page, but nobody will call out cursor AHgeuusn5d


Depending on what you mean by ‘page’, you probably want a query? (Show me all results with values between 1 and 10)

If you want to jump ahead in the results, that is fundamentally unstable though.


I see three options if this is a necessary use-case against a shared-changeable dataset:

1. Accept that the results will be unstable as the underlaying set changes. Pagination may either miss or double-include items unpredictably.

2. Store an intermediate result set guaranteed to remain stable for the necessary duration. This will provide stable pagination, at the cost of solving cache-expiry problems.

3. Use or build a version-controlled data store. I don't know of anything in common use, but there is likely something available. This is similar to #2, but moves the work from the application into the data-storage layer. You then paginate against a set version of the data. Imagine something similar to Immutable, but with expiry for unreachable nodes.

Unsure what #3 looks like at scale.


Or... Search not page. which is typically way more useful in reality.


I consider offset-based paged pagination to be a mark of the novice.


Or the mark of someone who asked the users what they wanted.


I don't disagree, but everyone would prefer to have an API that worked over one that doesn't, or one that causes service outages due to database latency. (did anyone ask the users if they wanted the api to work?)

If you're dealing with very small datasets, its fine. I'm an average person using average APIs, which means that when I see offset-based pagination, it's usually on a service deployed and used by a lot of people.

Unsurprisingly, the offset based APIs often include some other arbitrary limit like "offset limited to 10k" or something silly but understandable if you've built an API used by thousands of people before, or understand how databases work.

They're also often superseded by betters APIs that actually allow you to page the entire result set. Then you have a deprecated API that you either support forever or annoy users by turning it off.

So yes, if you are building something non-internal/pet project, limit/offset is probably the mark of the novice.


edit: I just saw another comment of yours, so I see this was meant more contextually than I realized.

Can you explain why offsets would never be a suitable solution? Is there a clear explanation as to why?

I understand how cursors are superior in some cases. But what if you have a site which paginates static data? Offsets would allow you to cache the results easily, overcoming any performance concerns (which would be irrelevant if the dataset was small anyway), providing a better user experience (and developer experience due to simpler implementation details) overall.

I can see that it would be a novice move if you’ve got staggering amounts of records that take a while to query. But that’s actually pretty rare in my experience.


It doesn't even have to be a cursor, you can technically page off of any field you can index (i've written a lot of sync APIs so there are gotchas, but the point stands).

Limit/offset is usually (though not always, as you point out) egregious for anything more than hobbies or small projects. But, I am biased as when I build APIs, I definitely expect some amount of traffic/volume/tuple count, and offset/limit will not do.


Cool, I think we mostly agree. Thanks for the explanation!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: