Search code examples
pythonfastanano

How to output only unique gene id's?


I am working on a project using the following command within nano:

from Bio import SeqIO
import sys
import re 

     fasta_file = (sys.argv[1])
        for myfile in SeqIO.parse(fasta_file, "fasta"):
          if len(myfile) > 250:
           gene_id = myfile.id
           mylist = re.match(r"H149xcV_\w+_\w+_\w+", gene_id)
           print (">"+list.group(0)) 

and its providing with the following outout:

    >H149xcV_Fge342_r3_h2_d1
    >H149xcV_bTr423_r3_h2_d1
    >H149xcV_kN893_r3_h2_d1
    >H149xcV_DNp021_r3_h2_d1
    >H149xcV_JEP3324_r3_h2_d1
    >H149xcV_JEP3324_r3_h2_d1
    >H149xcV_JEP3324_r3_h2_d1
    >H149xcV_JEP3324_r3_h2_d1
    >H149xcV_SRt424234_r3_h2_d1
    >H149xcV_SRt424234_r3_h2_d1
    >H149xcV_SRt424234_r3_h2_d1
    >H149xcV_SRt424234_r3_h2_d1

How can I change my command so that it provides me with and that are UNIQUE:

>H149xcV_Fge342_r3_h2
>H149xcV_bTr423_r3_h2
>H149xcV_kN893_r3_h2
>H149xcV_DNp021_r3_h2
>H149xcV_JEP3324_r3_h2
>H149xcV_SRt424234_r3_h2

Solution

  • You could use a capturing group and use that in the replacement.

    To prevent unnecessary backtracking, you can exclude the underscore from the word characters using a negated character class [^\W_]+

    (H149xcV_[^\W_]+_[^\W_]+)_[^\W_]+
    

    Regex demo

    list = re.match(r"(H149xcV_[^\W_]+_[^\W_]+)_[^\W_]+", gene_id)
    print (">"+list.group(1))