Let's for instance say I have a Cloud
environment and a Client
environment and I want to sync a large amount of data from the cloud to the client. Let's say I have a db table in the cloud named Files
and i want the exact identical table to exist in the client environment.
Now let assume a few things:
last-update
column.My solution:
LastSync
time in the client environment and keep syncing delta's from the LastSync
time.Count
of results for the delta and as many other requests needed by the Page Size
of each request. For example, the count:
SELECT COUNT(*) FROM files WHERE last_update > @LastSyncTime
The page fetching:
SELECT col1, col2..
FROM files
WHERE last_update > @LastSyncTime
ORDER BY files.id
LIMIT @LIMIT
OFFSET @OFFSET
My problem:
What if for example the first fetch(the Count
fetch) will take some time(few minutes for example) and in this time more entries have been updated and added to the last-update
fetch.
For example:
last-update 1000 seconds
.Count
.last-update 1000 seconds
will give 101 entries.id
I have tried 2 other options:
from-to
date limit for last-update
.last-update
instead of the id
column.I see issues in both options.
Do not use OFFSET
and LIMIT
; it goes from OK to slow to slower. Instead, keep track of "where you left off" with last_update
so that it can be more efficient. More Discussion
Since there can be dups of the datetime, be flexible about how many rows to do at once.
Run this continually. Don't use cron except as a 'keep-alive'.
There is no need for the initial copy; this code does it for you.
It is vital to have INDEX(last_update)
Here is the code:
-- Initialize. Note: This subtract is consistent with the later compare.
SELECT @left_off := MIN(last_update) - INTERVAL 1 DAY
FROM tbl;
Loop:
-- Get the ending timestamp:
SELECT @cutoff := last_update FROM tbl
WHERE last_update > @left_off
ORDER BY last_update
LIMIT 1 OFFSET 100; -- assuming you decide to do 100 at a time
-- if no result, sleep for a while, then restart
-- Get all the rows through that timestamp
-- This might be more than 100 rows
SELECT * FROM tbl
WHERE last_update > @left_off
AND last_update <= @cutoff
ORDER BY last_update
-- and transfer them
-- prep for next iteration
SET @left_off := @cutoff;
Goto Loop
SELECT @cutoff
will be fast -- it is a brief scan of 100 consecutive rows in the index.
SELECT *
does the heavy lifting, and takes time proportional to the number of rows -- no extra overhead for OFFSET
. 100 rows should take about 1 second for the read (assuming spinning disk, non-cached data).
Instead of initially getting COUNT(*)
, I would start by getting MAX(last_update)
since the rest of the code is basing on last_update
. This query is "instantaneous" since it only has to probe the end of the index. But I claim you don't even need that!
A possible bug: If rows in the 'source' can be deleted, how to you recognize that?