Search code examples
bashsedsubstitution

How to substitute a sequence of the same character and of variable length in bash?


I am quite sure this question already have an answer, but i can't find it so if there is one, please link it in the comments.

Otherwise, the problem i need to solve is how to substitute a sequence of the same character, that may occur once or more and we don't know how many times at max, with a single character, in order to organise strings with a known delimiter.

Also, in my specific case i have to substitute the * but i can do a preprocessing to substitute it with an easier-to-handle character.

This is a quite bad solution and it assumes that the max length of the pattern is known. But, of course, this is not true.

cat example_file.txt | sed 's/\*\*\*\*\*\*\*\*/_/g' | sed 's/\*\*\*\*\*\*\*/_/g' | sed 's/\*\*\*\*\*\*/_/g' | sed 's/\*\*\*\*\*/_/g' | sed 's/\*\*\*\*/_/g' | sed 's/\*\*\*/_/g' | sed 's/\*\*/_/g' | sed 's/\*/_/g' > clean_file.txt

with example_file.txt containing something like:

>SH1111056.09FU|KC881085_refs|k__Fungi;p__Ascomycota;c__Sordariomycetes;o__Hypocreales;f__Clavicipitaceae;g__Neotyphodium;s__Neotyphodium_siegelii;|foliar_endophyte*litter_saprotroph*class1_clavicipitaceous_endophyte**leaf/fruit/seed**non-aquatic*arthropod-associated*filamentous_mycelium******
>SH1115797.09FU|UDB031565_refs|k__Fungi;p__Basidiomycota;c__Agaricomycetes;o__Hymenochaetales;f__Hymenochaetaceae;g__Fomitiporia;s__Fomitiporia_hippophaeicola;|plant_pathogen*wood_saprotroph**wood_pathogen*wood*white_rot*non-aquatic**filamentous_mycelium*polyporoid*poroid****
>SH0879139.09FU|KF945456|k__Viridiplantae;p__Anthophyta;c__Eudicotyledonae;o__Lamiales;f__Acanthaceae;g__Ruellia;s__Ruellia_brandbergensis;|ND**************
>SH0991532.09FU|UDB07658019|k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Venturiales;f__Venturiaceae;g__Sympodiella;s__Sympodiella_sp;|litter_saprotroph****leaf/fruit/seed**non-aquatic**filamentous_mycelium******
>SH0991546.09FU|UDB07657573|k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Venturiales;f__Venturiaceae;g__Sympodiella;s__Sympodiella_sp;|litter_saprotroph****leaf/fruit/seed**non-aquatic**filamentous_mycelium******

EDIT:

the expected output, assuming the * is substituted with _ would be this:

>SH1111056.09FU|KC881085_refs|k__Fungi;p__Ascomycota;c__Sordariomycetes;o__Hypocreales;f__Clavicipitaceae;g__Neotyphodium;s__Neotyphodium_siegelii;|foliar_endophyte_litter_saprotroph_class1_clavicipitaceous_endophyte_leaf/fruit/seed_non-aquatic_arthropod-associated_filamentous_mycelium_
>SH1115797.09FU|UDB031565_refs|k__Fungi;p__Basidiomycota;c__Agaricomycetes;o__Hymenochaetales;f__Hymenochaetaceae;g__Fomitiporia;s__Fomitiporia_hippophaeicola;|plant_pathogen_wood_saprotroph_wood_pathogen_wood_white_rot_non-aquatic_filamentous_mycelium_polyporoid_poroid_
>SH0879139.09FU|KF945456|k__Viridiplantae;p__Anthophyta;c__Eudicotyledonae;o__Lamiales;f__Acanthaceae;g__Ruellia;s__Ruellia_brandbergensis;|ND_
>SH0991532.09FU|UDB07658019|k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Venturiales;f__Venturiaceae;g__Sympodiella;s__Sympodiella_sp;|litter_saprotroph_leaf/fruit/seed_non-aquatic_filamentous_mycelium_
>SH0991546.09FU|UDB07657573|k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Venturiales;f__Venturiaceae;g__Sympodiella;s__Sympodiella_sp;|litter_saprotroph_leaf/fruit/seed_non-aquatic_filamentous_mycelium_

Solution

  • check this out

    tr -s '*' '_' < example_file.txt > clean_file.txt  
    

    or

    cat example_file.txt | tr -s '*' '_' > clean_file.txt
    

    the output

    >SH1111056.09FU|KC881085_refs|k_Fungi;p_Ascomycota;c_Sordariomycetes;o_Hypocreales;f_Clavicipitaceae;g_Neotyphodium;s_Neotyphodium_siegelii;|foliar_endophyte_litter_saprotroph_class1_clavicipitaceous_endophyte_leaf/fruit/seed_non-aquatic_arthropod-associated_filamentous_mycelium_
    >SH1115797.09FU|UDB031565_refs|k_Fungi;p_Basidiomycota;c_Agaricomycetes;o_Hymenochaetales;f_Hymenochaetaceae;g_Fomitiporia;s_Fomitiporia_hippophaeicola;|plant_pathogen_wood_saprotroph_wood_pathogen_wood_white_rot_non-aquatic_filamentous_mycelium_polyporoid_poroid_
    >SH0879139.09FU|KF945456|k_Viridiplantae;p_Anthophyta;c_Eudicotyledonae;o_Lamiales;f_Acanthaceae;g_Ruellia;s_Ruellia_brandbergensis;|ND_
    >SH0991532.09FU|UDB07658019|k_Fungi;p_Ascomycota;c_Dothideomycetes;o_Venturiales;f_Venturiaceae;g_Sympodiella;s_Sympodiella_sp;|litter_saprotroph_leaf/fruit/seed_non-aquatic_filamentous_mycelium_
    >SH0991546.09FU|UDB07657573|k_Fungi;p_Ascomycota;c_Dothideomycetes;o_Venturiales;f_Venturiaceae;g_Sympodiella;s_Sympodiella_sp;|litter_saprotroph_leaf/fruit/seed_non-aquatic_filamentous_mycelium_