Search code examples
bashgenetics

Replace tip of newick file using reference list in bash


I have a collection of newick-formatted files containing gene IDs:

((gene1:1,gene2:1)100:1,gene3:1)100;
((gene4:1,gene5:1)100:1,gene6:1)100;

I have a list of equivalence between gene ID and species name:

speciesA=(gene1,gene4)
speciesB=(gene2,gene5)
speciesC=(gene3,gene6)

I would like to get the following output:

((speciesA:1,speciesB:1)100:1,speciesC:1)100;
((speciesA:1,speciesB:1)100:1,speciesC:1)100;

Any idea of how I could proceed? Ideally in bash would be awesome :)


Solution

  • input.txt

    ((gene1:1,gene2:1)100:1,gene3:1)100;
    ((gene4:1,gene5:1)100:1,gene6:1)100;
    

    equivs.txt

    speciesA=(gene1,gene4)
    speciesB=(gene2,gene5)
    speciesC=(gene3,gene6)
    

    convert.sh

    #!/bin/bash
    
    
    function replace() {
        output=$1
        for line in $(cat equivs.txt)  #this will fail if there is whitespace in your lines!
        do
            #gets the replacement string
            rep=$(echo $line | cut -d'=' -f1)
    
            #create a regex of all the possible matches we want to replace with $rep 
            targets=$(echo $line | cut -d'(' -f2- | cut -d')' -f1) 
            regex="($(echo $targets | sed -r 's/,/|/g'))"
    
            #do the replacements   
            output=$(echo $output | sed -r "s/${regex}/${rep}/g")
        done
        echo $output
    }
    
    #step through the input, file calling the above function on each line.
    #assuming all lines are formatted like the example!
    for line in $(cat input.txt)
    do
        replace $line
    done
    

    output:

    ((speciesA:1,speciesB:1)100:1,speciesC:1)100;
    ((speciesA:1,speciesB:1)100:1,speciesC:1)100;