Search code examples
c#mysqlpagingdata-paging

Fetching large loads of data using paging


Let's for instance say I have a Cloud environment and a Client environment and I want to sync a large amount of data from the cloud to the client. Let's say I have a db table in the cloud named Files and i want the exact identical table to exist in the client environment.

Now let assume a few things:

  1. The files table is very big.
  2. The data of each row in files can be updated at any time and has a last-update column.
  3. I want to fetch the delta's and make sure I am identical in both environments.

My solution:

  1. I make a full sync first, returning all the entries to the client.
  2. I keep the LastSync time in the client environment and keep syncing delta's from the LastSync time.
  3. I do the full sync and the delta syncs using paging: the client will fire a first request for getting the Count of results for the delta and as many other requests needed by the Page Size of each request.

For example, the count:

SELECT COUNT(*) FROM files WHERE last_update > @LastSyncTime

The page fetching:

SELECT col1, col2..
FROM files 
WHERE last_update > @LastSyncTime
ORDER BY files.id
LIMIT @LIMIT 
OFFSET @OFFSET

My problem:

What if for example the first fetch(the Count fetch) will take some time(few minutes for example) and in this time more entries have been updated and added to the last-update fetch.

For example:

  • The Count fetch gave 100 entries for last-update 1000 seconds.
  • 1 entry updated while fetching the Count.
  • Now the last-update 1000 seconds will give 101 entries.
  • The page fetch will only get 100 entries from the 101 with order by id
  • 1 entry is missed and not synced to the client

I have tried 2 other options:

  • Syncing with from-to date limit for last-update.
  • Ordering by last-update instead of the id column.

I see issues in both options.


Solution

    • Do not use OFFSET and LIMIT; it goes from OK to slow to slower. Instead, keep track of "where you left off" with last_update so that it can be more efficient. More Discussion

    • Since there can be dups of the datetime, be flexible about how many rows to do at once.

    • Run this continually. Don't use cron except as a 'keep-alive'.

    • There is no need for the initial copy; this code does it for you.

    • It is vital to have INDEX(last_update)

    Here is the code:

    -- Initialize.  Note: This subtract is consistent with the later compare. 
    SELECT @left_off := MIN(last_update) - INTERVAL 1 DAY
        FROM tbl;
    
    Loop:
    
        -- Get the ending timestamp:
        SELECT @cutoff := last_update FROM tbl
             WHERE last_update > @left_off
             ORDER BY last_update
             LIMIT 1  OFFSET 100;   -- assuming you decide to do 100 at a time
        -- if no result, sleep for a while, then restart
    
        -- Get all the rows through that timestamp
        -- This might be more than 100 rows
        SELECT * FROM tbl
            WHERE last_update > @left_off
              AND last_update <= @cutoff
            ORDER BY last_update
        -- and transfer them
    
        -- prep for next iteration
        SET @left_off := @cutoff;
    
    Goto Loop
    

    SELECT @cutoff will be fast -- it is a brief scan of 100 consecutive rows in the index.

    SELECT * does the heavy lifting, and takes time proportional to the number of rows -- no extra overhead for OFFSET. 100 rows should take about 1 second for the read (assuming spinning disk, non-cached data).

    Instead of initially getting COUNT(*), I would start by getting MAX(last_update) since the rest of the code is basing on last_update. This query is "instantaneous" since it only has to probe the end of the index. But I claim you don't even need that!

    A possible bug: If rows in the 'source' can be deleted, how to you recognize that?