Search code examples
bashawksednon-greedy

How to delete multiple matches in bash


I have multiple files with the following structure:

(Genome1_Sample4A_protein_Genome1_Sample4A_132_2:0.0060449,(Genome1_Sample5A_protein_Genome1_Sample5A_30_12:1e-06,(Genome1_Sample1B_protein_Genome1_Sample1B_99_2:1e-06,Genome1_Sample6A_protein_Genome1_Sample6A_295_2:0.00366292)n2:0.00370314)n1:0.0060449)n0; 

I would like to delete in each of them what comes between "_protein" and ":". So the output would be as follow:

(Genome1_Sample4A:0.0060449,(Genome1_Sample5A:1e-06,(Genome1_Sample1B:1e-06,Genome1_Sample6A:0.00366292)n2:0.00370314)n1:0.0060449)n0; 

I have tried to use sed and awk:

sed -i 's/_protein.*:/:/g' tree1.txt

sed -i 's/_protein.*_[[:digit:]]*:/:/g' tree1.txt

awk '{gsub(/\_protein*:/,":");}1' tree1.txt

But any of these codes gave me the desired output.


Solution

  • The .* is greedy, so use this instead:

    sed 's/_protein[^:]*:/:/g' tree1.txt
    

    Output:

    (Genome1_Sample4A:0.0060449,(Genome1_Sample5A:1e-06,(Genome1_Sample1B:1e-06,Genome1_Sample6A:0.00366292)n2:0.00370314)n1:0.0060449)n0;