Search code examples
linuxbashloopsrename

A bash script for replacing patterns in multiple files names based on a 2-column mapping file


I have a bunch of files with mixed IDs in a directory (linux env.) and look like this:

SRR7821874_1.fastq.gz
SRR7821874_2.fastq.gz
SRR7821870_1.fastq.gz
SRR7821870_2.fastq.gz

I also have a 2-column tab-delimited file (called rename.tsv) based on which I try to replace IDs:

Read       Sample      
SRR7821874 GSM3385663 
SRR7821870 GSM3385659  

Besides, I would like to concurrently change _1 to _S1_L001_R1_001 and _2 to _S1_L001_R2_001 in the file names, so the final result should look like this:

SRR7821874_1.fastq.gz --> GSM3385663_S1_L001_R1_001.fastq.gz
SRR7821874_2.fastq.gz --> GSM3385663_S1_L001_R2_001.fastq.gz
SRR7821870_1.fastq.gz --> GSM3385659_S1_L001_R1_001.fastq.gz
SRR7821870_2.fastq.gz --> GSM3385659_S1_L001_R2_001.fastq.gz   

I've tried the following script with no success as apparently it requires the full file names to rename them (just for ID replacement part):

while read -r Read Sample; do mv "$Read" "$Sample"; done < rename.tsv

Solution

  • You can try:

    tail -n+2 rename.tsv | while IFS=$'\t' read -r from to; do
      shopt -s nullglob
      for f in "${from}_"*.fastq.gz; do
        num="${f##*_}"; num="${num%%.*}"
        mv "$f" "${to}_S1_L001_R${num}_001.fastq.gz"
      done
    done
    

    We use tail to skip the header line, and we enable the nullglob bash option to expand "${from}_"*.fastq.gz as the null string instead of the pattern itself if no file matches. As this is part of a pipe the nullglob option is restored to its previous state at the end.

    "${f##*_}" and "${num%%.*}" are two of the numerous bash parameter expansions.

    Note that you can use a more accurate pattern if needed. For instance, if you know that the number is always 1 or 2 you could replace "${from}_"*.fastq.gz with "${from}_"[12].fastq.gz. Or, if it is any one-digit number: "${from}_"[0-9].fastq.gz.