Search code examples
bashawkfindprintfposix

Moving files after comparing filenames and recreating source directories


I'm learning shell scripting, and am striving to remain as POSIX compliant as possible while keeping the code-base somewhat readable. The goal is to read a list of files from directory A, find their matches from directory B, and recreate a portion of the directory parent B in directory C where the files from directory A should be moved, then remove the matched/moved files from directory B, and if the directories are then empty from directory B files found, remove them. All files in directory A will always be unique to each other, and there will always be one or more matches from directory B and never a match in directory C, but the sub-directories in directory C may already be present to match from directory B. All files matched in Directory B should be removed after matches are moved from Directory A to Directory C. Extensions change as files are processed separately, but filenames will otherwise match exactly. Filenames may contain spaces and periods. Filenames will not always be the same length. There are two levels of sub-directories in the output and archive directories.

Here's what I've got so far. I'm getting stuck on writing the for-loop to do the dirty work. Trying not to step too far outside of find, printf, awk, grep, for, and if.

#!/bin/sh
execHome="intendedMachine"
baseDir="/home/library/projects"
folderNew="output"
folderOld="working"
folderArchive="archive"
workingTypes=("jpg", "svg", "bmp", "tiff", "psd")

$folderNew="$baseDir/$folderNew"
$folderOld="$baseDir/$folderOld"
folderArchive="$baseDir/$folderArchive"

if [ "$(uname -n)" = "$execHome" ]
then

  count=$(find $folderNew -type f |grep -v "DS_Store" |awk -F "/" '{print $NF}'|wc -l)

  printf "\nFound/processing %s files in the %s folder\n\n" "$count" "$folderNew"

  find $folderNew -type f |grep -v "DS_Store" |awk -F "/" '{print $NF}'

else
  printf "Executed from %s; Run from %s for proper execution.\n" "$(uname -n)" "$execHome"
fi

Example:

Directory A

/home/library/projects/output/projectOne 1.a.png
/home/library/projects/output/projectOne 1.b.png
/home/library/projects/output/projectOne 1.c.png
/home/library/projects/output/projectThree 3.m.png
/home/library/projects/output/projectThree 3.o.png
/home/library/projects/output/projectFour 4.t.png
/home/library/projects/output/projectFour 4.u.png

Directory B

/home/library/projects/working/House/2018 01/projectOne 1.a.jpg
/home/library/projects/working/House/2018 01/projectOne 1.a.svg
/home/library/projects/working/House/2018 01/projectOne 1.b.jpg
/home/library/projects/working/House/2018 01/projectOne 1.b.svg
/home/library/projects/working/House/2018 01/projectOne 1.c.jpg
/home/library/projects/working/House/2018 02/projectTwo 2.g.jpg
/home/library/projects/working/House/2018 02/projectTwo 2.g.svg
/home/library/projects/working/House/2018 02/projectTwo 2.h.jpg
/home/library/projects/working/House/2018 02/projectTwo 2.h.svg
/home/library/projects/working/House/2018 02/projectTwo 2.i.jpg
/home/library/projects/working/Car/2018 03/projectThree 3.m.jpg
/home/library/projects/working/Car/2018 03/projectThree 3.n.jpg
/home/library/projects/working/Car/2018 03/projectThree 3.o.jpg
/home/library/projects/working/Car/2018 03/projectThree 3.o.svg
/home/library/projects/working/Car/2018 04/projectFour 4.s.jpg
/home/library/projects/working/Car/2018 04/projectFour 4.t.jpg
/home/library/projects/working/Car/2018 04/projectFour 4.u.jpg

Directory C

/home/library/projects/archive/House/2018 01/projectOne 1.d.png
/home/library/projects/archive/House/2018 01/projectOne 1.e.png
/home/library/projects/archive/House/2018 01/projectOne 1.f.png
/home/library/projects/archive/Car/2018 03/projectThree 3.p.png
/home/library/projects/archive/Car/2018 03/projectThree 3.q.png
/home/library/projects/archive/Car/2018 03/projectThree 3.r.png

Desired outcome:

Directory A files have been moved to Directory C

/home/library/projects/output/

Directory B should have Directory A files removed and empty folders deleted

/home/library/projects/working/House/2018 02/projectTwo 2.g.jpg
/home/library/projects/working/House/2018 02/projectTwo 2.g.svg
/home/library/projects/working/House/2018 02/projectTwo 2.h.jpg
/home/library/projects/working/House/2018 02/projectTwo 2.h.svg
/home/library/projects/working/House/2018 02/projectTwo 2.i.jpg
/home/library/projects/working/Car/2018 03/projectThree 3.n.jpg
/home/library/projects/working/Car/2018 04/projectFour 4.s.jpg

Directory C should contain both old archives and new output files as archives

/home/library/projects/archive/House/2018 01/projectOne 1.a.png
/home/library/projects/archive/House/2018 01/projectOne 1.b.png
/home/library/projects/archive/House/2018 01/projectOne 1.c.png
/home/library/projects/archive/House/2018 01/projectOne 1.d.png
/home/library/projects/archive/House/2018 01/projectOne 1.e.png
/home/library/projects/archive/House/2018 01/projectOne 1.f.png
/home/library/projects/archive/Car/2018 03/projectThree 3.m.png
/home/library/projects/archive/Car/2018 03/projectThree 3.o.png
/home/library/projects/archive/Car/2018 03/projectThree 3.p.png
/home/library/projects/archive/Car/2018 03/projectThree 3.q.png
/home/library/projects/archive/Car/2018 03/projectThree 3.r.png
/home/library/projects/archive/Car/2018 04/projectFour 4.t.png
/home/library/projects/archive/Car/2018 04/projectFour 4.u.png

Ran the code anyway from a bash 4.4.19 machine to see how it does, but it didn't work quite like I expected. Here's the resultant output:

Found/processing 4 files in the /home/library/projects/output folder

./auto-archive.sh: line 34: hash["$proj"]: bad array subscript
parent of /home/library/projects/output/.temp/projectThree 3.m.png not found
parent of /home/library/projects/output/projectOne 1.a.png not found
parent of /home/library/projects/output/.temp/projectThree 3.0.png not found
parent of /home/library/projects/output/projectFour 4.t.png not found

My apologies. I also didn't mention earlier that Directory B should not be scanned recursively, which in the use-case yields other temporary files that are being written, but may not yet be ready to move. Also, for the purposes of testing, only the four files listed above were actually in Directory A; not all the files listed initially. Further, after recreating the proposed test structure, your code seems to execute flawlessly; not matching the results from my actual file structure. I fear I may have missed some crucial element in describing my actual file structure/naming convention. Reviewing now for descriptor differences. Sorry to be taking time away, but certainly impressed with your accuracy. Feels like we're getting close, but definitely need to run on earlier version of bash.


Solution

  • The task will be divided into three steps:

    1. To create a map which associates each filename (project name) to its parent directory name in C. This is performed as a preparation stage by analyzing pathnames in B. We will make use of an associative array and the bash version must be 4.2 or newer.

    2. To loop over the files in A, compose a path name to be stored in C by using the map created in the 1st step, and remove files in B.

    3. As a clean-up stage, we remove empty directories in B, if any.

    Then how about:

    #!/bin/bash
    
    execHome="intendedMachine"
    baseDir="/home/library/projects"
    folderNew="output"
    folderOld="working"
    folderArchive="archive"
    workingTypes=("jpg" "svg" "bmp" "tiff" "psd")
    declare -A hash
    
    folderNew="$baseDir/$folderNew"
    folderOld="$baseDir/$folderOld"
    folderArchive="$baseDir/$folderArchive"
    
    if [ "$(uname -n)" != "$execHome" ]; then
        printf "Executed from %s; Run from %s for proper execution.\n" "$(uname -n)" "$execHome"
        exit
    fi
    
    count=$(find "$folderNew" -type f |grep -v "DS_Store" |awk -F "/" '{print $NF}'|wc -l)
    printf "\nFound/processing %s files in the %s folder\n\n" "$count" "$folderNew"
    
    # determine parent directory name for each project name and create a map for them
    while IFS=  read -r -d $'\0' f; do 
        proj="${f##*/}"         # remove dirname
        proj="${proj%.*}"               # remove extention
        parent="${f##*$baseDir/}"       # remove pathname until $baseDir
        parent="${parent#*/}"   # strip pathname one-level deeper
        parent="${parent%/*}"   # remove filename
        # now we're mapping "projectOne 1.a" => "House/2018 01" e.g.
    #   echo "$proj" "=>" "$parent"     # just for debugging
        hash["$proj"]="$parent"
    done < <(find "$folderOld" -type f -print0) # directory B
    
    # iterate over files in A; move to archive directory C and remove files in B
    while IFS=  read -r -d $'\0' f; do
        proj="${f##*/}"
        proj="${proj%.*}"
        parent="${hash[$proj]}"
        if [[ "$parent" = "" ]]; then
        echo "parent of $f not found"   # may not occur but just in case ..
        else
        # move from A to C
        destdir="$folderArchive/$parent"
        mkdir -p -- "$destdir"
        mv -- "$f" "$destdir"
    
        # remove relevant file(s) in B
        for ext in "${workingTypes[@]}"; do
            oldfile="$folderOld/$parent/$proj.${ext}"
            [ -f "$oldfile" ] && rm -f -- "$oldfile"
        done
        fi
    done < <(find "$folderNew" -type f -print0) # directory A
    
    # clean-up: remove empty dirs in B
    find "$folderOld" -type d -empty -print0 | xargs -r -0 rmdir --
    

    Explanations:

    • You do not have to use commas to split elements in an array.
    • You should not put $ prior to the variable name on the left-hand side.
    • The while IFS= ... done < <(find ...) syntax is an idiom to loop over the output of find.
    • The ${parameter#word} type of syntax is a parameter expansion to extract a substring from the path.
    • The associative array hash maps each project name, such as "projectOne 1.a" to its parent directory name, such as "House/2018 01".
    • --s in some commands are to prepare for the filenames which may start with -. (this protection may look pathological...)

    If your bash is older than 4.2, let me know. Then we need to find an alternative.

    EDIT
    Here's the POSIX compliant version as an alternative:
    (Apparently the script does not work if the filenames contain a newline or an escape character \x1b.)

    #!/bin/sh
    
    execHome="intendedMachine"
    baseDir="/home/library/projects"
    folderNew="output"
    folderOld="working"
    folderArchive="archive"
    workingTypes="jpg
    svg
    bmp
    tiff
    psd"
    
    folderNew="$baseDir/$folderNew"
    folderOld="$baseDir/$folderOld"
    folderArchive="$baseDir/$folderArchive"
    nl="
    "                   # set to newline character
    esc=$(/bin/echo -ne "\033")      # set to escape character
    #esc=":"            # if \033 does not work well, try another character
    
    # substitute of reading a hash
    # it relies on the context that IFS is set to $nl
    read_lut() {
        local i
        local key
        local val
        local ret=""
        for i in $lut; do
            key="${i%${esc}*}"
            val="${i#*${esc}}"
        if [ "$key" = "$1" ]; then
            # loop until the end and use the last value
            ret="$val"
        fi
        done
        echo "$ret"
    }
    
    # substitute of writing to a hash
    write_lut() {
        lut=$(printf "%s\n%s%c%s" "$lut" "$1" "$esc" "$2")
    }
    
    if [ "$(uname -n)" != "$execHome" ]; then
        printf "Executed from %s; Run from %s for proper execution.\n" "$(uname -n)" "$execHome"
        exit
    fi
    
    count=$(find "$folderNew" -type f |grep -v "DS_Store" |awk -F "/" '{print $NF}'|wc -l)
    printf "\nFound/processing %s files in the %s folder\n\n" "$count" "$folderNew"
    
    # determine parent directory name for each project name and create a map for them
    ifs_bak="$IFS"
    IFS="$nl"
    for f in $(find "$folderOld" -type f); do
        proj="${f##*/}"         # remove dirname
        proj="${proj%.*}"               # remove extention
        parent="${f##*$baseDir/}"       # remove pathname until $baseDir
        parent="${parent#*/}"   # strip pathname one-level deeper
        parent="${parent%/*}"   # remove filename
        # now we're mapping "projectOne 1.a" => "House/2018 01" e.g.
    #   echo "$proj" "=>" "$parent"     # just for debugging
        write_lut "$proj" "$parent"
    done
    
    # iterate over files in A; move to archive directory C and remove files in B
    for f in $(find "$folderNew" -type f); do
        proj="${f##*/}"
        proj="${proj%.*}"
        parent=$(read_lut "$proj")
        if [ "$parent" = "" ]; then
            echo "parent of $f not found"   # may not occur but just in case ..
        else
            # move from A to C
            destdir="$folderArchive/$parent"
            mkdir -p -- "$destdir"
            mv -- "$f" "$destdir"
    
            # remove relevant file(s) in B
            for ext in $workingTypes; do
                oldfile="$folderOld/$parent/$proj.${ext}"
                [ -f "$oldfile" ] && rm -f -- "$oldfile"
            done
        fi
    done
    
    # clean-up: remove empty dirs in B
    find "$folderOld" -type d -empty -print0 | xargs -r -0 rmdir --
    
    # restore IFS
    IFS="$ifs_bak"