The answer that worked for me and provided the most flexibility is by @M.NejatAydin:
#!/bin/bash
# cd "$1" || exit
FQPATH=$1
OUTPATH=$2
rm $OUTPATH/*
for src in $FQPATH/[^0-9]*.fastq.gz; do
FILENAME=${src##*/}
dst=${FILENAME#*_}
while [[ -e "$OUTPATH/$dst" ]]; do
n=${dst#*_S}
n=$(( ${n%%_*} + 1 ))
dst=${dst%%S*}S${n}_${dst#*_*_}
done
echo "cp -s "$src" "$FQPATH/ren/$dst""
cp -s "$src" "$FQPATH/ren/$dst"
echo 'END'
done
What I wanted
I have the following filenames in a folder:
A006200089_124769_S1_L001_R1_001.fastq.gz
A006200089_124769_S1_L001_R2_001.fastq.gz
A006200089_124771_S2_L001_R1_001.fastq.gz
A006200089_124771_S2_L001_R2_001.fastq.gz
A006850080_124769_S1_L001_R1_001.fastq.gz
A006850080_124769_S1_L001_R2_001.fastq.gz
A006850080_124769_S1_L002_R1_001.fastq.gz
A006850080_124769_S1_L002_R2_001.fastq.gz
A006850080_124771_S2_L001_R1_001.fastq.gz
A006850080_124771_S2_L001_R2_001.fastq.gz
A006850080_124771_S2_L002_R1_001.fastq.gz
A006850080_124771_S2_L002_R2_001.fastq.gz
Those have the following characteristics:
identifier_sampleName(integer)_S[1-100]_R[1-3]_001.fastq.gz
separated by _
.
In a following step the $identifier
will be deleted the filename will be trimmed to:
124771_S2_L002_R2_001.fastq.gz
The problem comes from the possibility of some of those entries to end up with identical filename:
A006200089_124769_S1_L001_R1_001.fastq.gz --> 124769_S1_L001_R1_001.fastq.gz
A006850080_124769_S1_L001_R1_001.fastq.gz --> 124769_S1_L001_R1_001.fastq.gz
What I want is
A006200089_124769_S1_L001_R1_001.fastq.gz --> 124769_S1_L001_R1_001.fastq.gz
A006200089_124769_S1_L001_R2_001.fastq.gz --> 124769_S1_L001_R2_001.fastq.gz
A006850080_124769_S1_L001_R1_001.fastq.gz --> 124769_S2_L001_R1_001.fastq.gz
A006850080_124769_S1_L001_R2_001.fastq.gz --> 124769_S2_L001_R2_001.fastq.gz
A006850080_124769_S1_L002_R1_001.fastq.gz --> 124769_S2_L002_R1_001.fastq.gz
A006850080_124769_S1_L002_R2_001.fastq.gz --> 124769_S2_L002_R2_001.fastq.gz
When there are just a few samples I am using the following code:
#!/bin/bash -l
for i in $1/A006850080*.fastq.gz
do
DIR=${i%/*}
base1=${i##*/}
NOEXT=${base1%.*}
NOEXT1=${NOEXT%.*}
A="$(echo $NOEXT1 | cut -d'_' -f1)"
B="$(echo $NOEXT1 | cut -d'_' -f2)"
C="$(echo $NOEXT1 | cut -d'_' -f3)"
D="$(echo $NOEXT1 | cut -d'_' -f4)"
E="$(echo $NOEXT1 | cut -d'_' -f5)"
F="$(echo $NOEXT1 | cut -d'_' -f6)"
SNUM=(${C:1})
NUM=$((SNUM+1))
mv $DIR/$base1 $DIR/$A"_"$B"_S"$NUM"_"$D"_"$E"_"$F".fastq.gz"
done
NUM=$((SNUM+1))
: in this line I have counted the occurrences of the A006200089_124769* filename and increased the S[1-100] part by that number.
This code is not enough if
more occurrences will be there:
A006850069_124769_S1_L001_R1_001.fastq.gz
A006850075_124769_S1_L001_R1_001.fastq.gz
A006200089_124769_S1_L001_R1_001.fastq.gz
A006850080_124769_S1_L001_R1_001.fastq.gz
more $sampleName
(could be in the range of 100s)
Is there a way to parse all files of the same $sampleName
and change the S[1-100] part so that no files will be overwritten?
Thank you in advance
Here is an implementation in plain bash:
$ cat /tmp/rename
#!/bin/bash
cd "$1" || exit
for src in [^0-9]*.fastq.gz; do
dst=${src#*_}
while [[ -e $dst ]]; do
n=${dst#*_S}
n=$(( ${n%%_*} + 1 ))
dst=${dst%%S*}S${n}_${dst#*_*_}
done
mv ./"$src" ./"$dst"
done
Test:
$ mkdir /tmp/test
$ cd /tmp/test
$ touch A00620008{0,9}_124769_S1_L00{1,2}_R{1,2}_001.fastq.gz
$ ls -1
A006200080_124769_S1_L001_R1_001.fastq.gz
A006200080_124769_S1_L001_R2_001.fastq.gz
A006200080_124769_S1_L002_R1_001.fastq.gz
A006200080_124769_S1_L002_R2_001.fastq.gz
A006200089_124769_S1_L001_R1_001.fastq.gz
A006200089_124769_S1_L001_R2_001.fastq.gz
A006200089_124769_S1_L002_R1_001.fastq.gz
A006200089_124769_S1_L002_R2_001.fastq.gz
$ /tmp/rename /tmp/test
$ ls -1
124769_S1_L001_R1_001.fastq.gz
124769_S1_L001_R2_001.fastq.gz
124769_S1_L002_R1_001.fastq.gz
124769_S1_L002_R2_001.fastq.gz
124769_S2_L001_R1_001.fastq.gz
124769_S2_L001_R2_001.fastq.gz
124769_S2_L002_R1_001.fastq.gz
124769_S2_L002_R2_001.fastq.gz