I have a script that runs a function on every item in my database to extract academic citations. The database is large, so the script takes about a week to run.
During that time, items are added and removed from the database.
The database is too large to pull completely into memory, so I have to chunk through it to process all the items.
Is there a way to ensure that when the script finishes, all of the items have been processed? Is this a pattern with a simple solution? So far my research hasn't revealed anything useful.
PS: Locking the table for a week isn't an option!
I would add a timestamp column "modified_at" to the table which defaults to null. So any new item can be identified.
Your script can then pick the chunks to work on based on that column.
update items
set modified_at = current_timestamp
from (
select id
from items
where modified_at is null
limit 1000 --<<< this defines the size of each "chunk" that you work on
) t
where t.id = items.id
returning items.*;
This will update 1000 rows that have not been processed as being processed and will return those rows in one single statement. Your job can then work on the returned items.
New rows need to be added with modified_at = null
and your script will pick them up based on the where modified_at is null
condition the next time you run it.
If you also change items while processing them, you need to update the modified_at
accordingly. In your script you will then need to store the last start of your processing somewhere. The next run of your script can then select items to be processed using
where modified_at is null
or modified_at < (last script start time)
If you only process each item only once (and then never again), you don't really need a timestamp, a simple boolean (e.g. is_processed
) would do as well.