Search code examples
bashdatabase-designtreetype-conversionbioinformatics

Convert tree file to Table format using bash


I have this input(input.txt):

(((((((hg38:0.00390111,panTro4:0.00466345):0.0067608,ponAbe2:0.0116062):0.00867419,((((rheMac3:0.00199139,macFas5:0.00136397):0.00219754,papAnu2:0.00373049):0.00221139,chlSab2:0.00690005):0.00434788,(nasLar1:0.00415921,rhiRox1:0.00361872):0.0075705):0.0149329):0.0129667,(calJac3:0.025809,saiBol1:0.0245054):0.0316131):0.0521649,tarSyr2:0.113368):0.00737652,(micMur1:0.0695349,otoGar3:0.105996):0.0356137):0.00510281,mm10:0.304925);

I'm trying to convert it to a table format using this code and adding "branch" when a number is not preceded by a species (hg38, panTro4, ponAbe2,rheMac3, macFas5,papAnu2, chSab2, nasLar1, rhiRox1, calJac3, saiBol1, tarSyr2, micMur1, otoGar3, mm10)

cat input.txt | tr ',' '\n' | tr ':' '\t' | tr -d '()' |
awk -F '\t' '{if (NF == 1) {printf "%s\t%s\n", "branch", $1} else {printf "%s\t%s\n", $1, $2}}'

However I'm getting this output:

branch  hg380.00390111 
branch  panTro40.004663450.0067608 
branch  ponAbe20.01160620.00867419 
branch  rheMac30.00199139 
branch  macFas50.001363970.00219754 
branch  papAnu20.003730490.00221139 
branch  chlSab20.006900050.00434788 
branch  nasLar10.00415921 
branch  rhiRox10.003618720.00757050.01493290.0129667 
branch  calJac30.025809 
branch  saiBol10.02450540.03161310.0521649 
branch  tarSyr20.1133680.00737652 
branch  micMur10.0695349 
branch  otoGar30.1059960.03561370.00510281 
branch  mm100.304925 

For instance, in this part: (((((((hg38:0.00390111,panTro4:0.00466345):0.0067608...

The 0.0067608 should be under the value of panTro4, and dubbed as in the output but it should be under which I call hag38-panTro4. The same should happen with the other branches This is the desired format output in a tab delimited table:

hg38    0.00390111
panTro4 0.00466345
hg38-panTro4    0.0067608 
ponAbe2 0.0116062
hg38-ponAbe2    0.00867419 
rheMac3 0.00199139 
macFas5 0.00136397
rheMac3-macFas5 0.00219754 
papAnu2 0.00373049
rheMac3-papAnu2 0.00221139 
chlSab2 0.00690005
rheMac3-chlSab2 0.00434788 
nasLar1 0.00415921 
rhiRox1 0.00361872
nasLar1-rhiRox1 0.0075705
rheMac3-nasLar1 0.0149329
hg38-rheMac3    0.0129667 
calJac3 0.025809 
saiBol1 0.0245054
calJac3-saiBol1 0.0316131
hg38-calJac3    0.0521649 
tarSyr2 0.113368
hg38-tarSyr2    0.00737652 
micMur1 0.0695349 
otoGar3 0.105996
micMur1-otoGar3 0.0356137
hg38-micMur1    0.00510281 
mm10    0.304925 

Solution

  • Assumptions/understandings:

    • each line of input is a new 'tree' (aka a new table)
    • each line of input is guaranteed to have a matching number of left and right parens
    • each line of input contains no white space (ie, no spaces, no tabs)
    • a species only occurs once in a line otherwise this answer may not generate the desired output
    • OP has access to GNU awk (aka gawk) so that we can make use of the 4th argument to the split() function

    General approach:

    • ouput format will be a) <species> <nbr> or b) <species>-<branch> <nbr>
    • we'll use a pair of stacks (implemented as arrays spec[] and branch[]) to keep track of our species and branches; s and b will be our array indices, respectively
    • split a line on multiple delimiters (, ) and ,; this will leave us with fields of the format a) <species>:<nbr> or b) :<nbr> or c) <empty>
    • if a field has the format <species>:<nbr> then we print <species> <nbr> to stdout and then look at the previous delimiter ...
    • if the previous delimiter was a ( then we push <species> onto both arrays otherwise we only push <species> onto the branch[] array
    • if a field has the format :<nbr> then we print <species>-<branch> <nbr> (ie, current top of the two stacks/arrays == spec[s]-branch[b]) and then pop the current entry from the branch stack (ie, b--); however, there is one exception to this step ...
    • if the top of both stacks is the same species (ie, spec[s] == branch[b]) then we first pop the top off the species stack (ie, s--) before performing the previous print and pop-of-the-branch-stack

    One GNU awk (aka gawk) approach:

    awk '
    BEGIN { OFS="\t" }
          { delete spec                                  # init stack/array
            delete branch                                # init stack/array
            s=b=0                                        # init array indices
            print "########### new table"
            n=split($0,arr,"[(),]",seps)                 # split current line on triple delimiters, fields go into array arr[] while delimiters go into array seps[]
    
            for (i=1;i<=n;i++) {                         # loop through our fields
                split(arr[i],x,":")                      # split current field into two pieces: x[1]/species and x[2]/nbr
    
                if (! arr[i])                            # if current field is empty then skip to next field
                   continue
                else
                if (! x[1]) {                            # if species is empty => field looks like ":<nbr" then ...
                   if (spec[s]==branch[b]) s--           # if top of both stacks is the same then pop the species stack
                   print spec[s] "-" branch[b], x[2]     # print our species-branch/nbr and ...
                   b--                                   # pop the branch stack
                }
                else
                if (seps[i-1] == "(") {                  # if previous delimiter was "(" then ...
                   spec[++s]=x[1]                        # push species onto both stacks
                   branch[++b]=x[1]
                   print x[1],x[2]                       # print current species/nbr to stdout
                }
                else
                if (seps[i-1] == ",") {                  # if previous delimiter was "," then ...
                   branch[++b]=x[1]                      # push species onto the branch stack
                   print x[1],x[2]                       # print current species/nbr to stdout
                }
            }
          }
    ' input.txt
    

    This generates:

    ########### new table
    hg38    0.00390111
    panTro4 0.00466345
    hg38-panTro4    0.0067608
    ponAbe2 0.0116062
    hg38-ponAbe2    0.00867419
    rheMac3 0.00199139
    macFas5 0.00136397
    rheMac3-macFas5 0.00219754
    papAnu2 0.00373049
    rheMac3-papAnu2 0.00221139
    chlSab2 0.00690005
    rheMac3-chlSab2 0.00434788
    nasLar1 0.00415921
    rhiRox1 0.00361872
    nasLar1-rhiRox1 0.0075705
    rheMac3-nasLar1 0.0149329
    hg38-rheMac3    0.0129667
    calJac3 0.025809
    saiBol1 0.0245054
    calJac3-saiBol1 0.0316131
    hg38-calJac3    0.0521649
    tarSyr2 0.113368
    hg38-tarSyr2    0.00737652
    micMur1 0.0695349
    otoGar3 0.105996
    micMur1-otoGar3 0.0356137
    hg38-micMur1    0.00510281
    mm10    0.304925