Given to the success of the question I posted here: Find empty files and their duplicates, thank you Freeman & Mark Setchell, I am now encouraged to ask another related question. In this case, the challenge is to delete the file if it lacks a partner.
The text2image tool in Tesseract sometimes fails to produce .box files at all.
The files are supposed to appear as triplets, as follows:
But, when the tool fails to produce the box file, what I get is just the two partner files as follows.
What I want is to delete those (gt.txt and .tif) files that lack the box partner.
I hope the description is clear.
is this what you want ?
#!/bin/bash
#create an array to store the box file names
declare -A box_files
#iterate over the files in the directory and populate the box_files array
for file in *.box; do
# Get the file name without extension
file_name="${file%.*}"
# Add the box file name to the array
box_files["$file_name"]=1
done
#iterate over the files in the directory
for file in *.{tif,gt.txt}; do
#get the file name without extension
file_name="${file%.*}"
#check if the file is a box file
if [[ "$file" == *.box ]]; then
continue # Skip box files
fi
#check if the corresponding box file exists in the array
if [ "${box_files["$file_name"]}" != 1 ]; then
#delete the files without a partner
rm "$file"
echo "Deleted $file"
fi
done