I'm converting scanned documents to text using an AI based on python. I have to process 200k files but around 25k files processed my OS killed the python script because of OOM. Now I want to run the script again but exclude all the files I already processed. An example of the code I created to find the files below
import os
import sys
from pathlib import Path
import itertools
companyfolder = sys.argv[1]
companypath = ("/home/user/download/" + companyfolder)
outputpath = ("/home/user/output/" + companyfolder + "/OCR")
errorpath = ("/home/user/output/" + companyfolder)
# run OCR loop
for file in itertools.chain(
Path(companypath).rglob("*.jpeg"),
Path(companypath).rglob("*.JPEG"),
Path(companypath).rglob("*.jpg"),
Path(companypath).rglob("*.JPG"),
Path(companypath).rglob("*.png"),
Path(companypath).rglob("*.PNG"),
Path(companypath).rglob("*.tif"),
Path(companypath).rglob("*.TIF"),
Path(companypath).rglob("*.tiff"),
Path(companypath).rglob("*.TIFF"),
Path(companypath).rglob("*.bmp"),
Path(companypath).rglob("*.BMP"),
Path(companypath).rglob("*.pdf"),
Path(companypath).rglob("*.PDF"),
):
try:
# make dirs and file path
print(file)
x more commands here and below
I have a list of files I already processed. Now I want to exclude this list of files from the globbing to avoid processing files I already processed. To match pattern I removed the suffix because my input and output suffix are different. An example of few files of the list of files I want to exclude below
/home/user/output/Google/OCR/Desktop/Desktop/Scans/james/processed-450ce329-1f42-4e13-bf0f-db9e2ee33103_WGNv7mVl
/home/user/output/Google/OCR/Desktop/Desktop/Scans/james/processed-db89fa50-7bba-4cf4-b898-c6839e2294be_vbsnLO4H
/home/user/output/Google/OCR/Desktop/Desktop/Scans/james/Screenshot_20220826-142620_Office
/home/user/output/Google/OCR/Desktop/Desktop/Scans/2022-04-25 11_07_09-Window
/home/user/output/Google/OCR/Desktop/Desktop/SCANS SPENDINGS - INCOMING INVOICES/Q2 2022/january/list [Q2 2022]
/home/user/output/Google/OCR/Desktop/Desktop/Fotoshoot office/IMG_5736
/home/user/output/Google/OCR/Desktop/Desktop/Fotoshoot office/IMG_5957
/home/user/output/Google/OCR/Desktop/Desktop/Fotoshoot office/IMG_5761
I hope some can learn me how do this
You are actually pretty much there given that you know the files.
All that you really need to do is add list of files to an iterable
(list, dict, set) and then do a logical check to see if each file was processed or not.
examples are like this:
# with a set (for the remaining files)
processed_files = {'file1.ext', 'file2.ext'}
set_of_files = {'file1.ext', 'file2.ext', 'file3.ext'}
for file in set_of_files:
if file in processed_files:
print(f' {file} does not need to be processed')
else:
print(f' {file} needs to be processed')
which would produce this:
file2.ext does not need to be processed
file1.ext does not need to be processed
file3.ext needs to be processed
or this:
# with a dict (if you know which are processed)
dict_of_files = {
'file1.ext':'processed',
'file2.ext':'processed',
'file3.ext':'failed'
}
for k, v in dict_of_files.items():
if v == 'failed': print(f'process file {k}')
else: print(f'{k} was already processed')
which would produce this:
file1.ext was already processed
file2.ext was already processed
process file file3.ext
From there, you would then invoke the processing routine from within the if statement
.
If a percentage of those failed, you could then repeat the process if necessary.