Search code examples
regexbashfilenamescompressed-files

Regex pattern that recognises file extension in Bash script not accurate to capture compressed files


I created this little Bash script that has one argument (a filename) and the script is supposed to respond according to the extension of the file:

#!/bin/bash

fileFormat=${1}

if [[ ${fileFormat} =~ [Ff][Aa]?[Ss]?[Tt]?[Qq]\.?[[:alnum:]]+$ ]]; then
    echo "its a FASTQ file";
elif [[ ${fileFormat} =~ [Ss][Aa][Mm] ]]; then
    echo "its a SAM file";
else
    echo "its not fasta nor sam";
fi

It's ran like this:

sh script.sh filename.sam

If it's a fastq (or FASTQ, or fq, or FQ, or fastq.gz (compressed)) I want the script to tell me "it's a fastq". If it's a sam, I want it to tell me it's a sam, and if not, I want to tell me it's neither sam or fastq.

THE PROBLEM: when I didn't consider the .gz (compressed) scenario, the script ran well and gave the result I expected, but something is happening when I try to add that last part to account for that situation (see third line, the part where it says .?[[:alnum:]]+ ). This part is meant to say "in the filename, after the extension (fastq in this case), there might be a dot plus some word afterwards".

My input is this:

sh script.sh filename.fastq.gz

And it works. But if I put: sh script.sh filename.fastq

It says it's not fastq. I wanted to put that last part as optional, but if I add a "?" at the end it doesn't work. Any thoughts? Thanks! My question would be to fix that part in order to work for both cases.


Solution

  • You may use this regex:

    fileFormat="$1"
    
    if [[ $fileFormat =~ [Ff]([Aa][Ss][Tt])?[Qq](\.[[:alnum:]]+)?$ ]]; then
        echo "its a FASTQ file"
    elif [[ $fileFormat =~ [Ss][Aa][Mm]$ ]]; then
        echo "its a SAM file"
    else
        echo "its not fasta nor sam"
    fi
    

    Here (\.[[:alnum:]]+)? makes last group optional which is dot followed by 1+ alphanumeric characters.

    When you run it as:

    ./script.sh filename.fastq
    its a FASTQ file
    
    ./script.sh fq
    its a FASTQ file
    
    ./script.sh filename.fastq.gz
    its a FASTQ file
    
    ./script.sh filename.sam
    its a SAM file
    
    ./script.sh filename.txt
    its not fasta nor sam