Search code examples
loopsbasename

"basename" command won't include multiple files


I have a problem with “basename” command as follow: In my host directory I have two samples’ fastq.gz files, named as:

A29_WES_S3_R1_001.fastq.gz  
A29_WES_S3_R2_001.fastq.gz  
A30_WES_S1_R1_001.fastq.gz  
A30_WES_S1_R2_001.fastq.gz  

Now I need to have their basename without suffix like:

A29_WES_S3_R1_001  
A29_WES_S3_R2_001  
A30_WES_S1_R1_001  
A30_WES_S1_R2_001  

I used the bash pipeline as follow:

#!/bin/bash
FILES1=(*R1_001.fastq.gz)
FILES2=(*R2_001.fastq.gz)
read1="${FILES1[@]}"
read2="${FILES2[@]}"
Ffile=$read1
Ffileprevix=$(basename "$Ffile" .fastq.gz)
Mfile=$read2
Mfileprevix=$(basename "$Mfile" .fastq.gz)
echo $Ffileprevix
echo $Mfileprevix
exit;

But every time I just get this output:

A29_WES_S3_R1_001.fastq.gz     A30_WES_S1_R1_001
A29_WES_S3_R2_001.fastq.gz     A30_WES_S1_R2_001

Only the last file (A30) would be included in the command!

I checked my pipeline in this way:

echo $read1
echo $read2

The result:

A29_WES_S3_R1_001.fastq.gz     A30_WES_S1_R1_001.fastq.gz
A29_WES_S3_R2_001.fastq.gz     A30_WES_S1_R2_001.fastq.gz

Then I did:

echo $Ffile
echo $Mfile

The result:

A29_WES_S3_R1_001.fastq.gz     A30_WES_S1_R1_001.fastq.gz
A29_WES_S3_R2_001.fastq.gz     A30_WES_S1_R2_001.fastq.gz

So $read1, $read2, $Ffile, and $Mfile work well.

Then I put “-a” in my basename command as it will take multiple files:

Ffileprevix=$(basename -a "$Ffile" .fastq.gz)
Mfileprevix=$(basename -a "$Mfile" .fastq.gz)

But it got worse! The result was like:

A29_WES_S3_R1_001.fastq.gz     A30_WES_S1_R1_001.fastq.gz     .fastq.gz
A29_WES_S3_R2_001.fastq.gz     A30_WES_S1_R2_001.fastq.gz     .fastq.gz

Finally, I tried “for ..... do ....” command to make a loop for basename command. Again, nothing changed!!

Is there anybody can help me to obtain what I want:
A29_WES_S3_R1_001
A29_WES_S3_R2_001
A30_WES_S1_R1_001
A30_WES_S1_R2_001


Solution

  • I'd leave basename out of this entirely, but that's entirely personal preference. You could do something more like:

    FILES_PATTERN_1=".*R1_001.fastq.gz"
    FILES_PATTERN_2=".*R2_001.fastq.gz"
    
    # Get FILE PATTERN 1
    echo "Pattern 1:"
    for FILE in $(find . | grep "${FILES_PATTERN_1}" | cut -d. -f2 | tr -d /); do 
      echo $FILE
    done
    
    # Get FILE PATTERN 2
    echo "Pattern 2:"
    for FILE in $(find . | grep "${FILES_PATTERN_2}" | cut -d. -f2 | tr -d /); do 
      echo $FILE 
    done
    

    Output should be:

    Pattern 1:
    A30_WES_S1_R1_001
    A29_WES_S3_R1_001
    Pattern 2:
    A29_WES_S3_R2_001
    A30_WES_S1_R2_001
    

    You could also play with awk to parse things instead:

    # Get FILE PATTERN 1
    echo "Pattern 1:"
    for FILE in $(find . | grep "${FILES_PATTERN_1}" | awk -F '[/.]' '{print $3}'); do
      echo $FILE
    done
    

    There are a number of ways to approach this. If you had a lot more patterns to test you could make more use of functions here to reduce code duplication.

    Also note, I'm doing this from a shell on Mac OSX, so if you're doing this from a Linux box some of these commands may need to be tweaked due to differences in output for some commands, like find. (ex: print $1 instead of print $3)