I am working on a project and am having issues with the following code that I have written in nano
:
from Bio import SeqIO
import sys
import re
fasta_file = (sys.argv[1])
for myfile in SeqIO.parse(fasta_file, "fasta"):
if len(myfile) > 250:
gene_id = myfile.id
list = re.match('H149xcV\_\w+\_\w+\_\w+', gene_id)
print (">"+list.group(1))
This is the error I receive when I execute my command on command-line
:
File "mpo.py", line 7, in <module>
gene_id = myfile.id
NameError: name 'myfile' is not defined
I have a fasta file with the format
>H149xcV_Fge342_r3_h2_d1 len=210 path=[0:0-206]
ACTATACATGAGGAGAACATAGAACAAAAATGGGACCATAGATATATAACAATAGAAGATATAGAGAACACAATAGACAACTTATTAGGAAAGAGGTGTGTCGTCATGGAGCTGATGTTCGAGGATACTTTGCATGGTCATTCTTGGATAATTTTGAGTGGGCTATGGGATACACCAAGAGGTTTGGCATTGTTTATGTTGATTATAAGAACGGGC
>H149xcV_ytR1oP_r3_h2_d1 len=306 path=[0:0-207]
ATTAGAGTCTGAGAGAGTCTTGATTTGTCGTCGTCGAGAAATATAGGAGATCTGATTAGAGGAGAGAGCGGCCTAGGCGATGCGCGATATAGCGCTATATAGGCCTAGAGGAGAGTCTCTCTCTTTTAGAAGAGATAATATATATATATATATGGCTCTCCGGCGGGGCCGCGCGAGAGCTCGATCGATCGATATTAGCTGTACGATGCTAGCTAGCTTATATTCGATCGATTATAGCTTAGATCTCTCTCTAAAGGTCGATATCGCTTATGCGCGCGTATATCG
I would like to reformat my file so that it only provides me with the unique gene id's and only output those genes id's with a length greater than 250 bp.
I would like my desired output to look like this:
>H149xcV_Fge342_r3_h2
>H149xcV_ytR1oP_r3_h2
>H149xcV_DPN78333_r3_h2
>H149xcV_AgV472_r3_h2
>H149xcV_DNP733_r3_h2
As suggested in the comments following your question, the parameter to match should be a string. The one thing I'll add is that python3 has a r"" string delimiter for regular expressions. Your code becomes this:
from Bio import SeqIO
import sys
import re
fasta_file = (sys.argv[1])
for myfile in SeqIO.parse(fasta_file, "fasta"):
if len(myfile) > 250:
gene_id = myfile.id
list = re.match(r"H149xcV_\w+_\w+_\w+", gene_id)
print (">"+list.group(0))
The underscore _ is not a special regular expression character (as I recall) so it doesn't need to be escaped.
The match()
function takes a regex and the string you are searching (so I added gene_id). Lastly, you want to output group(0)
. group(0) means the whole match. group(1) is from the first capturing paren (of which you have none) so stick with group(0).