I want to extract an incremental update from a MySQL table. The table in question has an auto-incremented ID field, and an updated_at field that is updated to the current timestamp on every change (INSERT/UPDATE). Records are never deleted from this table.
I would like to extract all newly created and updated records from this table since the last run of the script. I know the highest updated_at value from the last run. So the main query to extract records will be something like:
SELECT * FROM table WHERE updated_at >= :last_seen_updated_at
What is the best way of achieving this? I have to be 100% sure that all new and updated records will be extracted. Some questions and concerns:
updated_at >= :last_seen_updated_at - INTERVAL 1 MINUTE
)This isn't a complete answer, but it is a solid way to avoid duplicates. First, when you run the update script, don't run it for the current second. Run it at everything that's more than, say, 5 seconds old. That way, you know that if you have one record for a given second, you have all records for that second. Having done that, you should be able to use updated_at > :last_seen_updated_at
and avoid duplicates between update cycles.
At that point, you should feel free to extract in batches or one large query. At least as you've described the table, the data that you'll be extracting will at that point be largely static - there may well be other entries added to the table as you go, but they'll all be after whatever your arbitrary recent cut-off second is, and so won't be a part of your query.
With respect to the slave/master issue, that might be a bit of a concern, but as long as the master updates older records first, it's still pretty easy to solve. When setting your update cutoff, find the most recent timehack on a local record and subtract one second.
There is still the concern where an entry that was added or updated in the time section you were paying attention to might be updated again while you were extracting. The only way to strictly prevent that is to disconnect from the server and buffer changes while extracting, but you can cut down on the incidence significantly by going back and running the script again after you're done for the relatively small number of adds and modifies that occurred while you were extracting the larger group. You can repeat this as many times as you feel paranoia warrants. Alternately, you can count the number of rows pre-extraction, extract, and count the number extracted. If there's any difference, you can run the process again, until you get the same number before and after.