I'm creating a hypermedia driven RESTful API that will be used to query transactional data. The intention is that the results will be paginated.
Each API call will query an indexed database table. Since I don't want to keep the results server side due to memory considerations, I was thinking to retrieve the data based on rownum, dependent upon which page is requested. E.G. on page one, WHERE rownum <= 10
, on page two, WHERE rownum BETWEEN 11 AND 20
etc.
However, the database in question is replicated from a production system and could potentially add records into an area of the result set already requested. E.G. page one is requested -> 10 rows are returned -> a transaction is inserted at row 5. Now page two will include a record already displayed on page one, as the results are essentially pushed up by a rownum.
What would be a good way of achieving my objective of creating a hypermedia driven RESTful API that provides paginated transactional data from a database, without holding on to the result sets for the duration of the session?
This is a pretty common problem and there are actually not many approaches. I can think of only three, actually:
You don't care and the result will change. This is the behaviour of stackoverflow: if you're on page 2 of the questions page and someone posts a new question, when clicking on page 3 you may get one or more of the questions that were already listed on page 2, because the index has shifted.
If you don't want to keep in memory the actual data, you're in for a lot of trouble. You could store the handler for the result set, instead of the results themselves, and loop over it fetching the number of rows that you actually need. E.g. you run the select, fetch 10 rows and store the handler of the resultset. Together with the rows, you return to the client a unique ID of the query. The problem will be when you have a range specified, because you can't really "rewind" a database cursor, and that would mean caching the results, which you may want to do anyway. But if you do it like that, sooner or later you're going to have all of the results in memory anyway.
You could still use some memory, but keep only some unique identifier of the rows, associated with a unique identifier of the query, as above. This could work, but only if the rows may be added, and not deleted or updated (if they're updated, they may not match the query any more).
Personally, I'd go with option 1.