Search code examples
pythonxmlparsingblast

parsing .xml blast output with re


I'm trying to parse BLAST output in XML format using re, have never done it before, below is my code.

However,since some hits have Hsp_num sometimes more than once, I get more results for query_from and query_to, and less for query_len, how to specify that if Hsp_num is more than 1 do print query_len for it again? thank you

import re
output = open('result.txt','w')
n = 0
with open('file.xml','r') as xml:
    for line in xml:
         if re.search('<Hsp_query-from>', line) != None:
             line = line.strip()
             line = line.rstrip()
             line = line.strip('<Hsp_query-from>')
             line = line.rstrip('</')
             query_from = line
         if re.search('<Hsp_query-to>', line) != None:
             line = line.strip()
             line = line.rstrip()
             line = line.strip('<Hsp_query-to>')
             line = line.rstrip('</')
             query_to = line
         if re.search('<Hsp_num>', line) != None:
             line = line.strip()
             line = line.rstrip()
             line = line.strip('<Hsp_num>')
             line = line.rstrip('</')
             Hsp_num = line
             print >> output, Hsp_num+'\t'+query_from+'\t'+query_to
output.close()

I did query_len in a separate file, since it didnt work..

with open('file.xml','r') as xml:
    for line in xml:
        if re.search('<Iteration_query-len>', line) != None:
            line = line.strip()
            line = line.rstrip()
            line = line.strip('<Iteration_query-len>')
            line = line.rstrip('</')
            query_len = line  

Solution

  • Are you familiar with Biopython? Its Bio.Blast.NCBIXML module may be just what you need. Chapter 7 of the Tutorial and Cookbook is all about BLAST, and section 7.3 deals with parsing. You'll get an idea of how it works, and it will be a lot easier than using regex to parse XML, which will only lead to tears and mental breakdowns.