I'm trying to get rid of old asset versions. The file naming is strictly as follows:
<timestamp>_<constant><version><assetID>.zip<.extra>
for example, 202201012359_FOOBAR0101234567.zip.done.
<timestamp>
is the datetime when the file has been added to the folder.
<constant>
does not change inside the folder being handled.
<version>
is a two digit number starting from 00, and describes the version of the asset with the <assetID>
.
<extra>
is optional, so extension can be .zip, .zip.done, or .zip.somethingelse.
However, assets may have all three different extensions, and they can exist multiple times with different timestamps. This means that asset may have multiple additional files with the same ID and version number, but timestamp is different.
The goal would be finding the latest version of each asset with the same ID and removing older versions. The version number is what matters, not the timestamp.
Assets are all located in one folder without subfoldering.
So far, this is how the goal can be reached:
#!/bin/bash
location="/home/user/FOOBAR"
echo "Deleting older files..."
# Declare variable to print the outcome of removed asset ID's
declare -A assetsRemoved
# The main loop which finds all the files in the folder
find $location -maxdepth 1 -type f -name "*.zip*" -a -name "*FOOBAR*" | while read line; do
# <timestamp>_FOOBAR<iterator><assetId><file-extensions>
# 20201229104919_FOOBAR0300040682.zip.done
# Separate assetId
rest=${line#*'.zip'}
# .done
pos=$(( ${#line} - ${#rest} - 4 ))
# 20201229104919_FOOBAR0300040682<^>.zip.done
assetId=${line:pos-8:8}
# 20201229104919_FOOBAR03<00040682>.zip.done
# Find all files with same assetId
assets="$(find ~+ $location -maxdepth 1 -type f -name "*$assetId.zip*" -a -name "*FOOBAR*")"
# Init loop variables
max=-1
mostRecent=""
cleanedOld=0
# Loop all files with same assetId
for file in $assets
do
# Separate basename without extension
basenameNoExt="${file%%.*}"
# <20201229104919_FOOBAR00300040682>.zip.done
# Separate iterator, 2 numbers
iter=${basenameNoExt:${#basenameNoExt}-10:2}
# 20201229104919_FOOBAR0<03>00040682.zip.done
if [[ $iter -gt $max ]]
then
max=$iter
if [[ -n $mostRecent ]]
then
rm $mostRecent*
cleanedOld=1
fi
mostRecent=$basenameNoExt
elif [[ $iter -lt $max ]]
then
[ -f $file ] && rm $basenameNoExt*
cleanedOld=1
fi
# $iter == $max -> same asset with different file extension, leave to be
done
if [[ $max -gt 0 && cleanedOld -gt 0 ]]
then
assetsRemoved[$assetId]=$max
fi
done
for a in "${!assetsRemoved[@]}"; do
echo "Cleaned asset $a from versions lower than ${assetsRemoved[$a]}"
done
This solution has serious problem: it is slow. As it first finds all the files, takes one and figures out the max version while deleting older versions, the next iteration in the outermost find-loop tries to execute the find-remove-command to assets that may have been already handled or deleted.
Is there a way to execute commands for each result of find
before all the results have been gathered? Or is there some other, more efficient way to loop the results? There are over 100k files to handle, and I assume that wildcard rm
loops them through when searching relevant files to delete. This would require over 100.000^2 iterations through the files. Is there any way to prevent this?
Consider a folder with the following files:
20191229104919_FOOBAR0001234567.zip
20191229104919_FOOBAR0001234567.zip.done
20191229104919_FOOBAR0001234567.zip.somethingelse
20191229104919_FOOBAR0087654321.zip
20191129104919_FOOBAR0087654321.zip.done
20191129104919_FOOBAR0087654321.zip.somethingelse
20191129100000_FOOBAR0187654321.zip
20191229100000_FOOBAR0187654321.zip.done
20191229100000_FOOBAR0187654321.zip.somethingelse
20201229104919_FOOBAR0101234567.zip
20201229104919_FOOBAR0101234567.zip.done
20201229104919_FOOBAR0101234567.zip.somethingelse
20211229104919_FOOBAR0201234567.zip
20211229104919_FOOBAR0201234567.zip.done
20211229104919_FOOBAR0201234567.zip.somethingelse
20221229104919_FOOBAR0201234567.zip
20221229104919_FOOBAR0201234567.zip.done
20221229104919_FOOBAR0201234567.zip.somethingelse
The remaining files after cleaning would be:
20191129100000_FOOBAR0187654321.zip
20191229100000_FOOBAR0187654321.zip.done
20191229100000_FOOBAR0187654321.zip.somethingelse
20211229104919_FOOBAR0201234567.zip
20211229104919_FOOBAR0201234567.zip.done
20211229104919_FOOBAR0201234567.zip.somethingelse
20221229104919_FOOBAR0201234567.zip
20221229104919_FOOBAR0201234567.zip.done
20221229104919_FOOBAR0201234567.zip.somethingelse
Notice:
Newest version is what matters. Same asset version&ID with different timestamps and extensions must be preserved
Thanks to @ogus for the working and clean solution!
I'll add the final solution used in the end for the sake of documentation and to clarify the use of xargs
in this case.
#!/bin/bash
# Takes optional argument to delete found assets while running.
removeFound=${1:-n}
location="/home/user/bashtest"
if [[ "$removeFound" =~ ^(y|Y|yes|Yes|YES)$ ]]
then
echo "Deleting older assets from $location"
else
echo "Searching old assets from $location"
fi
# Find all .zip and .zip.somethingelse -files, pipe lines to awk, save to variable
assetsToDelete=`printf '%s\n' $location/*.zip* | awk '{
# <timestamp>FOOBAR<iterator><assetId><file-extensions>
# 20191229104919_FOOBAR0387654321.zip.done
# Extension position
extPos = index($0, ".zip")
# 20191229104919_FOOBAR0387654321<^>.zip.done
# Separate asset ID
assetId = substr($0, extPos - 8, 8)
# 20191229104919_FOOBAR03<87654321>.zip.done
# Separate iterator, 2 numbers
assetVer = substr($0, extPos - 10, 2)
# 20191229104919_FOOBAR<03>87654321.zip.done
# List variables used below:
# assetList -> [assetId][asset file(s)] -> keys: list of asset IDs encountered, values: one or more asset file paths, absolute, separated by ORS (newline)
# maxAssetV -> [assetId][assetMaxVersion] -> keys: list of asset IDs encountered, values: maximum version of the corresponding asset encountered
# Everything printed out with <print> is the output of the awk-command, thus to be deleted
# Find if ID has not been recorded, or version is smaller than recorded
if (!(assetId in assetList) || assetVer > maxAssetV[assetId]) {
# Asset recorded, version is smaller, remove old asset by printing its path
if (assetId in assetList)
print assetList[assetId]
# Record new or newer asset
assetList[assetId] = $0
maxAssetV[assetId] = assetVer
}
# Find if asset is the same version as current max version
else if (assetVer == maxAssetV[assetId]) {
# Record the asset by stacking it on the list, separated with ORS (newline)
assetList[assetId] = assetList[assetId] ORS $0
}
# Asset recorded and with smaller version -> print thus delete
else {
print
}
}' `
if [ -z "$assetsToDelete" ]; then
echo "Zero older assets found in the $location"
else
if [[ "$removeFound" =~ ^(y|Y|yes|Yes|YES)$ ]]
then
echo $assetsToDelete | awk -v OFS="\n" '{$1=$1}1' | xargs -n1 -I {} sh -c 'echo {}; rm {}'
else
echo "Moving files to ./remove folder, delete manually from there."
echo "To delete on the go, run script with parameter <yes>"
echo $assetsToDelete | awk -v OFS="\n" '{$1=$1}1' | xargs -n1 -I {} sh -c 'echo {}; mv {} $(dirname {})/remove/'
fi
fi
exit