Search code examples
bashawkcarriage-return

How to store file paths from a tab separated text file in a bash array


I have a tab separated text file with one column of file paths, e.g. table.txt

> SampleID  Factor  Condition   Replicate   Treatment   Type    Dataset isPE    ReadLength  isREF   PathFASTQ
> DG13  fd3 c1  1   cc  0   0102    0   50  1   "/path/to/fastq"
> DG14  fd3 c1 1    cc  1   0102    0   50  1   "/path/to/fastq"

I would like to store the paths in a bash array so I can use these in a downstream parallel computation (SGE Task Arrays). For simplicity, the leading and trailing " can easily be not included in table.txt.

Excluding the header line, I tried the following:

files=($(awk '{ if(($8 == 0)) { print $1} }' table.txt ))    
paths=($(awk '{ if(($8 == 0)) { print $11} }' table.txt ))
infile="${paths[$SGE_TASK_ID]}"/"${files[$SGE_TASK_ID]}".fastq.gz

$SGE_TASK_ID takes a user-defined integer value between (1-N) in case someone does not know.

Unfortunately $infile does not show the expected value for $SGE_TASK_ID=1:

/path/to/fastq/DG13.fastq.gz

Thanks for your help.


Solution

  • Could you please try following, this code will remove Control M characters during run of the code.

    myarr=($(awk '{gsub(/\r/,"")} match($NF,/\/[^"]*/){\
             val=substr($NF,RSTART,RLENGTH);\
             num=split(val,array,"/");\
             print val"/"$1"."array[num]".gz"}'  Input_file))
    for i in "${myarr[@]}"
    do
      echo $i
    done
    

    In case you want to remove control M characters from your Input_file itself then try running following too:

    tr -d '\r' < Input_file > temp && mv temp Input_file
    

    When we print array with loop as above shown, output will be as follows.

    /path/to/fastq/DG13.fastq.gz
    /path/to/fastq/DG14.fastq.gz
    

    Explanation of awk code:

    awk '                                 ##Starting awk program from here.
    match($NF,/\/[^"]*/){                 ##Using match function of awk program here, match everything till " in last field.
      val=substr($NF,RSTART,RLENGTH)      ##Creating variable val which is sub-string where starting point is RSTART till value of RLENGTH.
      num=split(val,array,"/")            ##Creating variable num whose value is number of elements plitted by split, splitting val into array with / is delimiter.
      print val"/"$1"."array[num]".gz"    ##Printing val / first field DOT array last element then .gz here.
    }
    '  Input_file                         ##Mentioning Input_file name here.