Search code examples
bashawkfind

Delete a file if it lacks a duplicate (partner)


Given to the success of the question I posted here: Find empty files and their duplicates, thank you Freeman & Mark Setchell, I am now encouraged to ask another related question. In this case, the challenge is to delete the file if it lacks a partner.

The text2image tool in Tesseract sometimes fails to produce .box files at all.

The files are supposed to appear as triplets, as follows:

  • File1.box
  • File1.gt.txt
  • File1.tif
  • File2.box
  • File2.gt.txt
  • File2.tif

But, when the tool fails to produce the box file, what I get is just the two partner files as follows.

  • File3.gt.txt
  • File3.tif

What I want is to delete those (gt.txt and .tif) files that lack the box partner.

I hope the description is clear.


Solution

  • is this what you want ?

    #!/bin/bash
    
    #create an array to store the box file names
    declare -A box_files
    
    #iterate over the files in the directory and populate the box_files array
    for file in *.box; do
        # Get the file name without extension
        file_name="${file%.*}"
        # Add the box file name to the array
        box_files["$file_name"]=1
    done
    
    #iterate over the files in the directory
    for file in *.{tif,gt.txt}; do
        #get the file name without extension
        file_name="${file%.*}"
    
        #check if the file is a box file
        if [[ "$file" == *.box ]]; then
            continue # Skip box files
        fi
    
        #check if the corresponding box file exists in the array
        if [ "${box_files["$file_name"]}" != 1 ]; then
            #delete the files without a partner
            rm "$file"
            echo "Deleted $file"
        fi
    done