Search code examples
pythonregexshellbioinformaticsfasta

Receiving NameError - how to fix?


I am working on a project and am having issues with the following code that I have written in nano:

from Bio import SeqIO
import sys
import re 


    fasta_file = (sys.argv[1])
    for myfile in SeqIO.parse(fasta_file, "fasta"):
      if len(myfile) > 250:
       gene_id = myfile.id
       list = re.match('H149xcV\_\w+\_\w+\_\w+', gene_id)
       print (">"+list.group(1)) 

This is the error I receive when I execute my command on command-line:

File "mpo.py", line 7, in <module>
    gene_id = myfile.id
NameError: name 'myfile' is not defined

I have a fasta file with the format

>H149xcV_Fge342_r3_h2_d1 len=210 path=[0:0-206]
ACTATACATGAGGAGAACATAGAACAAAAATGGGACCATAGATATATAACAATAGAAGATATAGAGAACACAATAGACAACTTATTAGGAAAGAGGTGTGTCGTCATGGAGCTGATGTTCGAGGATACTTTGCATGGTCATTCTTGGATAATTTTGAGTGGGCTATGGGATACACCAAGAGGTTTGGCATTGTTTATGTTGATTATAAGAACGGGC

 >H149xcV_ytR1oP_r3_h2_d1 len=306 path=[0:0-207]
    ATTAGAGTCTGAGAGAGTCTTGATTTGTCGTCGTCGAGAAATATAGGAGATCTGATTAGAGGAGAGAGCGGCCTAGGCGATGCGCGATATAGCGCTATATAGGCCTAGAGGAGAGTCTCTCTCTTTTAGAAGAGATAATATATATATATATATGGCTCTCCGGCGGGGCCGCGCGAGAGCTCGATCGATCGATATTAGCTGTACGATGCTAGCTAGCTTATATTCGATCGATTATAGCTTAGATCTCTCTCTAAAGGTCGATATCGCTTATGCGCGCGTATATCG

I would like to reformat my file so that it only provides me with the unique gene id's and only output those genes id's with a length greater than 250 bp.

I would like my desired output to look like this:

>H149xcV_Fge342_r3_h2
>H149xcV_ytR1oP_r3_h2
>H149xcV_DPN78333_r3_h2
>H149xcV_AgV472_r3_h2
>H149xcV_DNP733_r3_h2

Solution

  • As suggested in the comments following your question, the parameter to match should be a string. The one thing I'll add is that python3 has a r"" string delimiter for regular expressions. Your code becomes this:

    from Bio import SeqIO
    import sys
    import re 
    
    
        fasta_file = (sys.argv[1])
        for myfile in SeqIO.parse(fasta_file, "fasta"):
          if len(myfile) > 250:
           gene_id = myfile.id
           list = re.match(r"H149xcV_\w+_\w+_\w+", gene_id)
           print (">"+list.group(0)) 
    

    The underscore _ is not a special regular expression character (as I recall) so it doesn't need to be escaped.

    The match() function takes a regex and the string you are searching (so I added gene_id). Lastly, you want to output group(0). group(0) means the whole match. group(1) is from the first capturing paren (of which you have none) so stick with group(0).