I have this input(input.txt):
(((((((hg38:0.00390111,panTro4:0.00466345):0.0067608,ponAbe2:0.0116062):0.00867419,((((rheMac3:0.00199139,macFas5:0.00136397):0.00219754,papAnu2:0.00373049):0.00221139,chlSab2:0.00690005):0.00434788,(nasLar1:0.00415921,rhiRox1:0.00361872):0.0075705):0.0149329):0.0129667,(calJac3:0.025809,saiBol1:0.0245054):0.0316131):0.0521649,tarSyr2:0.113368):0.00737652,(micMur1:0.0695349,otoGar3:0.105996):0.0356137):0.00510281,mm10:0.304925);
I'm trying to convert it to a table format using this code and adding "branch" when a number is not preceded by a species (hg38, panTro4, ponAbe2,rheMac3, macFas5,papAnu2, chSab2, nasLar1, rhiRox1, calJac3, saiBol1, tarSyr2, micMur1, otoGar3, mm10)
cat input.txt | tr ',' '\n' | tr ':' '\t' | tr -d '()' |
awk -F '\t' '{if (NF == 1) {printf "%s\t%s\n", "branch", $1} else {printf "%s\t%s\n", $1, $2}}'
However I'm getting this output:
branch hg380.00390111
branch panTro40.004663450.0067608
branch ponAbe20.01160620.00867419
branch rheMac30.00199139
branch macFas50.001363970.00219754
branch papAnu20.003730490.00221139
branch chlSab20.006900050.00434788
branch nasLar10.00415921
branch rhiRox10.003618720.00757050.01493290.0129667
branch calJac30.025809
branch saiBol10.02450540.03161310.0521649
branch tarSyr20.1133680.00737652
branch micMur10.0695349
branch otoGar30.1059960.03561370.00510281
branch mm100.304925
For instance, in this part: (((((((hg38:0.00390111,panTro4:0.00466345):0.0067608...
The 0.0067608 should be under the value of panTro4, and dubbed as in the output but it should be under which I call hag38-panTro4. The same should happen with the other branches This is the desired format output in a tab delimited table:
hg38 0.00390111
panTro4 0.00466345
hg38-panTro4 0.0067608
ponAbe2 0.0116062
hg38-ponAbe2 0.00867419
rheMac3 0.00199139
macFas5 0.00136397
rheMac3-macFas5 0.00219754
papAnu2 0.00373049
rheMac3-papAnu2 0.00221139
chlSab2 0.00690005
rheMac3-chlSab2 0.00434788
nasLar1 0.00415921
rhiRox1 0.00361872
nasLar1-rhiRox1 0.0075705
rheMac3-nasLar1 0.0149329
hg38-rheMac3 0.0129667
calJac3 0.025809
saiBol1 0.0245054
calJac3-saiBol1 0.0316131
hg38-calJac3 0.0521649
tarSyr2 0.113368
hg38-tarSyr2 0.00737652
micMur1 0.0695349
otoGar3 0.105996
micMur1-otoGar3 0.0356137
hg38-micMur1 0.00510281
mm10 0.304925
Assumptions/understandings:
GNU awk
(aka gawk
) so that we can make use of the 4th argument to the split()
functionGeneral approach:
<species> <nbr>
or b) <species>-<branch> <nbr>
spec[]
and branch[]
) to keep track of our species and branches; s
and b
will be our array indices, respectively(
, )
and ,
; this will leave us with fields of the format a) <species>:<nbr>
or b) :<nbr>
or c) <empty>
<species>:<nbr>
then we print <species> <nbr>
to stdout and then look at the previous delimiter ...(
then we push <species>
onto both arrays otherwise we only push <species>
onto the branch[]
array:<nbr>
then we print <species>-<branch> <nbr>
(ie, current top of the two stacks/arrays == spec[s]-branch[b]
) and then pop the current entry from the branch stack (ie, b--
); however, there is one exception to this step ...spec[s] == branch[b]
) then we first pop the top off the species stack (ie, s--
) before performing the previous print and pop-of-the-branch-stackOne GNU awk
(aka gawk
) approach:
awk '
BEGIN { OFS="\t" }
{ delete spec # init stack/array
delete branch # init stack/array
s=b=0 # init array indices
print "########### new table"
n=split($0,arr,"[(),]",seps) # split current line on triple delimiters, fields go into array arr[] while delimiters go into array seps[]
for (i=1;i<=n;i++) { # loop through our fields
split(arr[i],x,":") # split current field into two pieces: x[1]/species and x[2]/nbr
if (! arr[i]) # if current field is empty then skip to next field
continue
else
if (! x[1]) { # if species is empty => field looks like ":<nbr" then ...
if (spec[s]==branch[b]) s-- # if top of both stacks is the same then pop the species stack
print spec[s] "-" branch[b], x[2] # print our species-branch/nbr and ...
b-- # pop the branch stack
}
else
if (seps[i-1] == "(") { # if previous delimiter was "(" then ...
spec[++s]=x[1] # push species onto both stacks
branch[++b]=x[1]
print x[1],x[2] # print current species/nbr to stdout
}
else
if (seps[i-1] == ",") { # if previous delimiter was "," then ...
branch[++b]=x[1] # push species onto the branch stack
print x[1],x[2] # print current species/nbr to stdout
}
}
}
' input.txt
This generates:
########### new table
hg38 0.00390111
panTro4 0.00466345
hg38-panTro4 0.0067608
ponAbe2 0.0116062
hg38-ponAbe2 0.00867419
rheMac3 0.00199139
macFas5 0.00136397
rheMac3-macFas5 0.00219754
papAnu2 0.00373049
rheMac3-papAnu2 0.00221139
chlSab2 0.00690005
rheMac3-chlSab2 0.00434788
nasLar1 0.00415921
rhiRox1 0.00361872
nasLar1-rhiRox1 0.0075705
rheMac3-nasLar1 0.0149329
hg38-rheMac3 0.0129667
calJac3 0.025809
saiBol1 0.0245054
calJac3-saiBol1 0.0316131
hg38-calJac3 0.0521649
tarSyr2 0.113368
hg38-tarSyr2 0.00737652
micMur1 0.0695349
otoGar3 0.105996
micMur1-otoGar3 0.0356137
hg38-micMur1 0.00510281
mm10 0.304925