Search code examples
linuxbashbatch-rename

Rename files giving an increased numeric value based on previous occurrences, using bash


The answer that worked for me and provided the most flexibility is by @M.NejatAydin:

#!/bin/bash
# cd "$1" || exit

FQPATH=$1
OUTPATH=$2
rm $OUTPATH/*
for src in $FQPATH/[^0-9]*.fastq.gz; do
        FILENAME=${src##*/}
        dst=${FILENAME#*_}
        while [[ -e "$OUTPATH/$dst" ]]; do
                n=${dst#*_S}
                n=$(( ${n%%_*} + 1 ))
                dst=${dst%%S*}S${n}_${dst#*_*_}
        done
        echo "cp -s  "$src" "$FQPATH/ren/$dst""
        cp -s  "$src" "$FQPATH/ren/$dst"
echo 'END'
done

What I wanted

I have the following filenames in a folder:

A006200089_124769_S1_L001_R1_001.fastq.gz
A006200089_124769_S1_L001_R2_001.fastq.gz
A006200089_124771_S2_L001_R1_001.fastq.gz
A006200089_124771_S2_L001_R2_001.fastq.gz
A006850080_124769_S1_L001_R1_001.fastq.gz
A006850080_124769_S1_L001_R2_001.fastq.gz
A006850080_124769_S1_L002_R1_001.fastq.gz
A006850080_124769_S1_L002_R2_001.fastq.gz
A006850080_124771_S2_L001_R1_001.fastq.gz
A006850080_124771_S2_L001_R2_001.fastq.gz
A006850080_124771_S2_L002_R1_001.fastq.gz
A006850080_124771_S2_L002_R2_001.fastq.gz

Those have the following characteristics:

identifier_sampleName(integer)_S[1-100]_R[1-3]_001.fastq.gz

separated by _.

In a following step the $identifier will be deleted the filename will be trimmed to:

124771_S2_L002_R2_001.fastq.gz

The problem comes from the possibility of some of those entries to end up with identical filename:

A006200089_124769_S1_L001_R1_001.fastq.gz --> 124769_S1_L001_R1_001.fastq.gz
A006850080_124769_S1_L001_R1_001.fastq.gz --> 124769_S1_L001_R1_001.fastq.gz

What I want is

A006200089_124769_S1_L001_R1_001.fastq.gz --> 124769_S1_L001_R1_001.fastq.gz
A006200089_124769_S1_L001_R2_001.fastq.gz --> 124769_S1_L001_R2_001.fastq.gz
A006850080_124769_S1_L001_R1_001.fastq.gz --> 124769_S2_L001_R1_001.fastq.gz
A006850080_124769_S1_L001_R2_001.fastq.gz --> 124769_S2_L001_R2_001.fastq.gz
A006850080_124769_S1_L002_R1_001.fastq.gz --> 124769_S2_L002_R1_001.fastq.gz
A006850080_124769_S1_L002_R2_001.fastq.gz --> 124769_S2_L002_R2_001.fastq.gz

When there are just a few samples I am using the following code:

#!/bin/bash -l

 for i in $1/A006850080*.fastq.gz
do
 DIR=${i%/*}
 base1=${i##*/}
 NOEXT=${base1%.*}
 NOEXT1=${NOEXT%.*}
    
 A="$(echo $NOEXT1 | cut -d'_' -f1)"
 B="$(echo $NOEXT1 | cut -d'_' -f2)"
 C="$(echo $NOEXT1 | cut -d'_' -f3)"
 D="$(echo $NOEXT1 | cut -d'_' -f4)"
 E="$(echo $NOEXT1 | cut -d'_' -f5)"
 F="$(echo $NOEXT1 | cut -d'_' -f6)"

SNUM=(${C:1})
NUM=$((SNUM+1))
mv $DIR/$base1 $DIR/$A"_"$B"_S"$NUM"_"$D"_"$E"_"$F".fastq.gz"
done

NUM=$((SNUM+1)): in this line I have counted the occurrences of the A006200089_124769* filename and increased the S[1-100] part by that number.

This code is not enough if

  • more occurrences will be there:

      A006850069_124769_S1_L001_R1_001.fastq.gz
      A006850075_124769_S1_L001_R1_001.fastq.gz 
      A006200089_124769_S1_L001_R1_001.fastq.gz 
      A006850080_124769_S1_L001_R1_001.fastq.gz 
    
  • more $sampleName (could be in the range of 100s)

Is there a way to parse all files of the same $sampleName and change the S[1-100] part so that no files will be overwritten?

Thank you in advance


Solution

  • Here is an implementation in plain bash:

    $ cat /tmp/rename

    #!/bin/bash
    
    cd "$1" || exit
    
    for src in [^0-9]*.fastq.gz; do
        dst=${src#*_}
        while [[ -e $dst ]]; do
            n=${dst#*_S}
            n=$(( ${n%%_*} + 1 ))
            dst=${dst%%S*}S${n}_${dst#*_*_}
        done
        mv  ./"$src" ./"$dst"
    done
    

    Test:

    $ mkdir /tmp/test
    $ cd /tmp/test
    $ touch A00620008{0,9}_124769_S1_L00{1,2}_R{1,2}_001.fastq.gz
    $ ls -1
    A006200080_124769_S1_L001_R1_001.fastq.gz
    A006200080_124769_S1_L001_R2_001.fastq.gz
    A006200080_124769_S1_L002_R1_001.fastq.gz
    A006200080_124769_S1_L002_R2_001.fastq.gz
    A006200089_124769_S1_L001_R1_001.fastq.gz
    A006200089_124769_S1_L001_R2_001.fastq.gz
    A006200089_124769_S1_L002_R1_001.fastq.gz
    A006200089_124769_S1_L002_R2_001.fastq.gz
    $ /tmp/rename /tmp/test
    $ ls -1
    124769_S1_L001_R1_001.fastq.gz
    124769_S1_L001_R2_001.fastq.gz
    124769_S1_L002_R1_001.fastq.gz
    124769_S1_L002_R2_001.fastq.gz
    124769_S2_L001_R1_001.fastq.gz
    124769_S2_L001_R2_001.fastq.gz
    124769_S2_L002_R1_001.fastq.gz
    124769_S2_L002_R2_001.fastq.gz