Search code examples
bashloopsfind

Find and execute line by line before final result in bash


I'm trying to get rid of old asset versions. The file naming is strictly as follows:
<timestamp>_<constant><version><assetID>.zip<.extra>
for example, 202201012359_FOOBAR0101234567.zip.done.
<timestamp> is the datetime when the file has been added to the folder.
<constant> does not change inside the folder being handled.
<version> is a two digit number starting from 00, and describes the version of the asset with the <assetID>.
<extra> is optional, so extension can be .zip, .zip.done, or .zip.somethingelse.
However, assets may have all three different extensions, and they can exist multiple times with different timestamps. This means that asset may have multiple additional files with the same ID and version number, but timestamp is different.
The goal would be finding the latest version of each asset with the same ID and removing older versions. The version number is what matters, not the timestamp.
Assets are all located in one folder without subfoldering.

Current solution

So far, this is how the goal can be reached:

#!/bin/bash
location="/home/user/FOOBAR"
echo "Deleting older files..."

# Declare variable to print the outcome of removed asset ID's
declare -A assetsRemoved

# The main loop which finds all the files in the folder
find $location -maxdepth 1 -type f -name "*.zip*" -a -name "*FOOBAR*" | while read line; do
    # <timestamp>_FOOBAR<iterator><assetId><file-extensions>
    # 20201229104919_FOOBAR0300040682.zip.done

    # Separate assetId
    rest=${line#*'.zip'}
    # .done
    pos=$(( ${#line} - ${#rest} - 4 ))
    # 20201229104919_FOOBAR0300040682<^>.zip.done
    assetId=${line:pos-8:8}
    # 20201229104919_FOOBAR03<00040682>.zip.done

    # Find all files with same assetId
    assets="$(find ~+ $location -maxdepth 1 -type f -name "*$assetId.zip*" -a -name "*FOOBAR*")"
    # Init loop variables
    max=-1
    mostRecent=""
    cleanedOld=0

    # Loop all files with same assetId
    for file in $assets
    do
        # Separate basename without extension
        basenameNoExt="${file%%.*}"
        # <20201229104919_FOOBAR00300040682>.zip.done
        # Separate iterator, 2 numbers
        iter=${basenameNoExt:${#basenameNoExt}-10:2}
        # 20201229104919_FOOBAR0<03>00040682.zip.done
        if [[ $iter -gt $max ]]
        then
            max=$iter
            if [[ -n $mostRecent ]]
            then
                rm $mostRecent*
                cleanedOld=1
            fi
            mostRecent=$basenameNoExt
        elif [[ $iter -lt $max ]]
        then
            [ -f $file ] && rm $basenameNoExt* 
            cleanedOld=1
        fi
        # $iter == $max -> same asset with different file extension, leave to be
    done

    if [[ $max -gt 0  && cleanedOld -gt 0 ]] 
    then
        assetsRemoved[$assetId]=$max
    fi

done

for a in "${!assetsRemoved[@]}"; do
    echo "Cleaned asset $a from versions lower than ${assetsRemoved[$a]}"
done

The problem

This solution has serious problem: it is slow. As it first finds all the files, takes one and figures out the max version while deleting older versions, the next iteration in the outermost find-loop tries to execute the find-remove-command to assets that may have been already handled or deleted.

The question

Is there a way to execute commands for each result of find before all the results have been gathered? Or is there some other, more efficient way to loop the results? There are over 100k files to handle, and I assume that wildcard rm loops them through when searching relevant files to delete. This would require over 100.000^2 iterations through the files. Is there any way to prevent this?

Example

Consider a folder with the following files:

20191229104919_FOOBAR0001234567.zip
20191229104919_FOOBAR0001234567.zip.done
20191229104919_FOOBAR0001234567.zip.somethingelse
20191229104919_FOOBAR0087654321.zip
20191129104919_FOOBAR0087654321.zip.done
20191129104919_FOOBAR0087654321.zip.somethingelse
20191129100000_FOOBAR0187654321.zip
20191229100000_FOOBAR0187654321.zip.done
20191229100000_FOOBAR0187654321.zip.somethingelse
20201229104919_FOOBAR0101234567.zip
20201229104919_FOOBAR0101234567.zip.done
20201229104919_FOOBAR0101234567.zip.somethingelse
20211229104919_FOOBAR0201234567.zip
20211229104919_FOOBAR0201234567.zip.done
20211229104919_FOOBAR0201234567.zip.somethingelse
20221229104919_FOOBAR0201234567.zip
20221229104919_FOOBAR0201234567.zip.done
20221229104919_FOOBAR0201234567.zip.somethingelse

The remaining files after cleaning would be:

20191129100000_FOOBAR0187654321.zip
20191229100000_FOOBAR0187654321.zip.done
20191229100000_FOOBAR0187654321.zip.somethingelse
20211229104919_FOOBAR0201234567.zip
20211229104919_FOOBAR0201234567.zip.done
20211229104919_FOOBAR0201234567.zip.somethingelse
20221229104919_FOOBAR0201234567.zip
20221229104919_FOOBAR0201234567.zip.done
20221229104919_FOOBAR0201234567.zip.somethingelse

Notice:
Newest version is what matters. Same asset version&ID with different timestamps and extensions must be preserved


Solution

  • Thanks to @ogus for the working and clean solution!
    I'll add the final solution used in the end for the sake of documentation and to clarify the use of xargs in this case.

    #!/bin/bash
    
    # Takes optional argument to delete found assets while running.
    removeFound=${1:-n}
    location="/home/user/bashtest"
    
    if [[ "$removeFound" =~ ^(y|Y|yes|Yes|YES)$ ]]
    then
        echo "Deleting older assets from $location"
    else
        echo "Searching old assets from $location"
    fi
    
    
    # Find all .zip and .zip.somethingelse -files, pipe lines to awk, save to variable
    assetsToDelete=`printf '%s\n' $location/*.zip* | awk '{
        # <timestamp>FOOBAR<iterator><assetId><file-extensions>
        # 20191229104919_FOOBAR0387654321.zip.done
    
        # Extension position
        extPos = index($0, ".zip")
        # 20191229104919_FOOBAR0387654321<^>.zip.done
        
        # Separate asset ID
        assetId = substr($0, extPos - 8, 8)
        # 20191229104919_FOOBAR03<87654321>.zip.done
        
        # Separate iterator, 2 numbers
        assetVer = substr($0, extPos - 10, 2)
        # 20191229104919_FOOBAR<03>87654321.zip.done
    
        # List variables used below:
        # assetList -> [assetId][asset file(s)] -> keys: list of asset IDs encountered, values: one or more asset file paths, absolute, separated by ORS (newline)
        # maxAssetV -> [assetId][assetMaxVersion] -> keys: list of asset IDs encountered, values: maximum version of the corresponding asset encountered
    
        # Everything printed out with <print> is the output of the awk-command, thus to be deleted
    
        # Find if ID has not been recorded, or version is smaller than recorded
        if (!(assetId in assetList) || assetVer > maxAssetV[assetId]) {
            # Asset recorded, version is smaller, remove old asset by printing its path
            if (assetId in assetList)
                print assetList[assetId]
    
            # Record new or newer asset
            assetList[assetId] = $0
            maxAssetV[assetId] = assetVer
        }
        # Find if asset is the same version as current max version
        else if (assetVer == maxAssetV[assetId]) {
            # Record the asset by stacking it on the list, separated with ORS (newline)
            assetList[assetId] = assetList[assetId] ORS $0
        }
        # Asset recorded and with smaller version -> print thus delete
        else {
            print
        }
    }' `
    
    if [ -z "$assetsToDelete" ]; then
        echo "Zero older assets found in the $location"
    else
        if [[ "$removeFound" =~ ^(y|Y|yes|Yes|YES)$ ]]
        then
            echo $assetsToDelete | awk -v OFS="\n" '{$1=$1}1' | xargs -n1 -I {} sh -c 'echo {}; rm {}'   
        else
            echo "Moving files to ./remove folder, delete manually from there."
            echo "To delete on the go, run script with parameter <yes>"
    
            echo $assetsToDelete | awk -v OFS="\n" '{$1=$1}1' | xargs -n1 -I {} sh -c 'echo {}; mv {} $(dirname {})/remove/'
        fi
    fi
    
    exit