Search code examples
pythonxmlxml-parsingiterparse

best practices for iterparse usage while keeping the context?


Following a question I asked on iterparse general usage (and its answer by J F Sebastian) I will reorganise my code to parse nessus XML result files. Quoting from the earlier question, the file structure is

<ReportHost host="host1">
  <ReportItem id="100">
    <foo>9.3</foo>
    <bar>hello</bar>
  </ReportItem>
  <ReportItem id="200">
     <foo>10.0</foo>
     <bar>world</bar>
</ReportHost>
<ReportHost host="host2">
   ...
</ReportHost>

In other words a lot of hosts (ReportHost) with a lot of items to report (ReportItem), and the latter having several characteristics (foo, bar). I will be looking at generating one line per item, with its characteristics:

host1,id="100",foo="9.3",bar="hello"
host1,id="200"foo="10.0",bar="world"
host2,...

I understand how to extract given fileds from the XML file (this is in essence the answer to my previous question). I need to keep these extracted fields in context (= I need to know which ReportHost and ReportItem they relate to). My idea was to use a marker, a variable which would tell me whether I am in a ReportHost or ReportItem block and decide from there (if inReportHost: ...) - I have a fear that this is not the proper way to navigate XML with iterparse, though.

Is there a "best practices" document which would refer to that?

EDIT: improved example following comments


Solution

  • When iterating over items via etree.iterparse() and detecting them via end, you'll have to preserve the intermediate elements in order to display, to which host they belong to.

    In your example, the first two parsed elements are <ReportItem id="100"> and <ReportItem id="200">. <ReportHost host="host1"> comes next. That should be the point where you concatenate the preserved intermediate information and print the ReportHost details at once.

    Another way would be parsing the document twice, first time for collecting the ReportHost data, second time - to print the each ReportItem details.

    You can profile these methods to find which one suits you best.