Search code examples
pythonxmlperformancelarge-filesexpat-parser

What is the most efficient way of extracting information from a large number of xml files in python?


I have a directory full (~103, 104) of XML files from which I need to extract the contents of several fields. I've tested different xml parsers, and since I don't need to validate the contents (expensive) I was thinking of simply using xml.parsers.expat (the fastest one) to go through the files, one by one to extract the data.

  1. Is there a more efficient way? (simple text matching doesn't work)
  2. Do I need to issue a new ParserCreate() for each new file (or string) or can I reuse the same one for every file?
  3. Any caveats?

Thanks!


Solution

  • The quickest way would be to match strings (with, e.g., regular expressions) instead of parsing XML - depending on your XMLs this could actually work.

    But the most important thing is this: instead of thinking through several options, just implement them and time them on a small set. This will take roughly the same amount of time, and will give you real numbers do drive you forward.

    EDIT:

    • Are the files on a local drive or network drive? Network I/O will kill you here.
    • The problem parallelizes trivially - you can split the work among several computers (or several processes on a multicore computer).