Search code examples
bashshellparsingtextcut

An alternative: cut -d <string>?


When I type ls I get:

aedes_aegypti_upstream_dremeready_all_simpleMasked_random.fasta
anopheles_albimanus_upstream_dremeready_all_simpleMasked_random.fasta
anopheles_arabiensis_upstream_dremeready_all_simpleMasked_random.fasta
anopheles_stephensi_upstream_dremeready_all_simpleMasked_random.fasta
culex_quinquefasciatus_upstream_dremeready_all_simpleMasked_random.fasta

I want to pipe this into cut (or via some alternative way) so that I only get:

aedes_aegypti
anopheles_albimanus
anopheles_arabiensis
anopheles_stephensi
culex_quinquefasciatus

If cut would accept a string (multiple characters) as it's delimiter then I could use:

cut -d "_upstream_" -f1

But that is not permitted as cut only takes single characters as delimiters.


Solution

  • awk does allow a string as delimiter:

    $ awk -F"_upstream_" '{print $1}' file
    aedes_aegypti
    anopheles_albimanus
    anopheles_arabiensis
    anopheles_stephensi
    culex_quinquefasciatus
    drosophila_melanogaster
    

    Note for the given input you can also use cut with _ as delimiter and print first two records:

    $ cut -d'_' -f-2 file
    aedes_aegypti
    anopheles_albimanus
    anopheles_arabiensis
    anopheles_stephensi
    culex_quinquefasciatus
    drosophila_melanogaster
    

    sed and grep can also make it. For example, this grep uses a look-ahead to print everything from the beginning of the line until you find _upstream:

    $ grep -Po '^\w*(?=_upstream)' file
    aedes_aegypti
    anopheles_albimanus
    anopheles_arabiensis
    anopheles_stephensi
    culex_quinquefasciatus
    drosophila_melanogaster