Search code examples
pythonloopsglobpathlib

Exclude a list of files/patterns from pathlib glob


I'm converting scanned documents to text using an AI based on python. I have to process 200k files but around 25k files processed my OS killed the python script because of OOM. Now I want to run the script again but exclude all the files I already processed. An example of the code I created to find the files below

import os
import sys
from pathlib import Path
import itertools

companyfolder = sys.argv[1]
companypath = ("/home/user/download/" + companyfolder)
outputpath = ("/home/user/output/" + companyfolder + "/OCR")
errorpath = ("/home/user/output/" + companyfolder)

# run OCR loop
for file in itertools.chain(
    Path(companypath).rglob("*.jpeg"),
    Path(companypath).rglob("*.JPEG"),
    Path(companypath).rglob("*.jpg"),
    Path(companypath).rglob("*.JPG"),
    Path(companypath).rglob("*.png"),
    Path(companypath).rglob("*.PNG"),
    Path(companypath).rglob("*.tif"),
    Path(companypath).rglob("*.TIF"),
    Path(companypath).rglob("*.tiff"),
    Path(companypath).rglob("*.TIFF"),
    Path(companypath).rglob("*.bmp"),
    Path(companypath).rglob("*.BMP"),
    Path(companypath).rglob("*.pdf"),
    Path(companypath).rglob("*.PDF"),
):
    try:
        # make dirs and file path
        print(file)
        x more commands here and below

I have a list of files I already processed. Now I want to exclude this list of files from the globbing to avoid processing files I already processed. To match pattern I removed the suffix because my input and output suffix are different. An example of few files of the list of files I want to exclude below

/home/user/output/Google/OCR/Desktop/Desktop/Scans/james/processed-450ce329-1f42-4e13-bf0f-db9e2ee33103_WGNv7mVl
/home/user/output/Google/OCR/Desktop/Desktop/Scans/james/processed-db89fa50-7bba-4cf4-b898-c6839e2294be_vbsnLO4H
/home/user/output/Google/OCR/Desktop/Desktop/Scans/james/Screenshot_20220826-142620_Office
/home/user/output/Google/OCR/Desktop/Desktop/Scans/2022-04-25 11_07_09-Window
/home/user/output/Google/OCR/Desktop/Desktop/SCANS SPENDINGS - INCOMING INVOICES/Q2 2022/january/list [Q2 2022]
/home/user/output/Google/OCR/Desktop/Desktop/Fotoshoot office/IMG_5736
/home/user/output/Google/OCR/Desktop/Desktop/Fotoshoot office/IMG_5957
/home/user/output/Google/OCR/Desktop/Desktop/Fotoshoot office/IMG_5761

I hope some can learn me how do this


Solution

  • You are actually pretty much there given that you know the files.

    All that you really need to do is add list of files to an iterable (list, dict, set) and then do a logical check to see if each file was processed or not.

    examples are like this:

    # with a set (for the remaining files)
    processed_files = {'file1.ext', 'file2.ext'}
    set_of_files = {'file1.ext', 'file2.ext', 'file3.ext'}
    
    for file in set_of_files:
        if file in processed_files:
            print(f' {file} does not need to be processed')
        else:
            print(f' {file} needs to be processed')
    

    which would produce this:

     file2.ext does not need to be processed
     file1.ext does not need to be processed
     file3.ext needs to be processed
    

    or this:

    # with a dict (if you know which are processed)
    dict_of_files = {
        'file1.ext':'processed', 
        'file2.ext':'processed', 
        'file3.ext':'failed'
        }
    
    for k, v in dict_of_files.items():
        if v == 'failed': print(f'process file {k}')
        else: print(f'{k} was already processed')
    

    which would produce this:

    file1.ext was already processed
    file2.ext was already processed
    process file file3.ext
    

    From there, you would then invoke the processing routine from within the if statement.

    If a percentage of those failed, you could then repeat the process if necessary.