Search code examples
macosshellzshoh-my-zsh

How do I ignore a byte order marker from a while read loop in zsh


I need to verify that all images mentioned in a csv are present inside a folder. I wrote a small shell script for that

#!/bin/zsh
red='\033[0;31m'
color_Off='\033[0m'

csvfile=$1
imgpath=$2

cat $csvfile | while IFS=, read -r filename rurl
do
    if [ -f "${imgpath}/${filename}" ]
    then
        echo -n
    else
        echo -e "$filename ${red}MISSING${color_Off}"
    fi
done

My CSV looks something like

Image1.jpg,detail-1
Image2.jpg,detail-1
Image3.jpg,detail-1

The csv was created by excel.

Now all 3 images are present in imgpath but for some reason my output says

Image1.jpg MISSING

Upon using zsh -x to run the script i found that my CSV file has a BOM at the very beginning making the image name as \ufeffImage1.jpg which is causing the whole issue.

How can I ignore a BOM(byte-order marker) in a while read operation?


Solution

  • zsh provides a parameter expansion (also available in POSIX shells) to remove a prefix: ${var#prefix} will expand to $var with prefix removed from the front of the string.

    zsh also, like ksh93 and bash, supports ANSI C-like string syntax: $'\ufeff' refers to the Unicode sequence for a BOM.

    Combining these, one can refer to ${filename#$'\ufeff'} to refer to the content of $filename but with the Unicode sequence for a BOM removed if it's present at the front.

    The below also makes some changes for better performance, more reliable behavior with odd filenames, and compatibility with non-zsh shells.

    #!/bin/zsh
    red='\033[0;31m'
    color_Off='\033[0m'
    
    csvfile=$1
    imgpath=$2
    
    while IFS=, read -r filename rurl; do
        filename=${filename#$'\ufeff'}
        if ! [ -f "${imgpath}/${filename}" ]; then
            printf '%s %bMISSING%b\n' "$filename" "$red" "$color_Off"
        fi
    done <"$csvfile"
    

    Notes on changes unrelated to the specific fix:

    • Replacing echo -e with printf lets us pick which specific variables get escape sequences expanded: %s for filenames means backslashes and other escapes in them are unmodified, whereas %b for $red and $color_Off ensures that we do process highlighting for them.
    • Replacing cat $csvfile | with < "$csvfile" avoids the overhead of starting up a separate cat process, and ensures that your while read loop is run in the same shell as the rest of your script rather than a subshell (which may or may not be an issue for zsh, but is a problem with bash when run without the non-default lastpipe flag).
    • echo -n isn't reliable as a noop: some shells print -n as output, and the POSIX echo standard, by marking behavior when -n is present as undefined, permits this. If you need a noop, : or true is a better choice; but in this case we can just invert the test and move the else path into the truth path.