Search code examples
pythonxmlparsingindexingsax

Is there a fast XML parser in Python that allows me to get start of tag as byte offset in stream?


I am working with potentially huge XML files containing complex trace information from on of my projects.

I would like to build indexes for those XML files so that one can quickly find sub sections of the XML document without having to load it all into memory.

If I have created a "shelve" index that could contains information like "books for author Joe" are at offsets [22322, 35446, 54545] then I can just open the xml file like a regular text file and seek to those offsets and then had that to one of the DOM parser that takes a file or strings.

The part that I have not figured out yet is how to quickly parse the XML and create such an index.

So what I need as a fast SAX parser that allows me to find the start offset of tags in the file together with the start events. So I can parse a subsection of the XML together with the starting point into the document, extract the key information and store the key and offset in the shelve index.


Solution

  • Since locators return line and column numbers in lieu of offset, you need a little wrapping to track line ends -- a simplified example (could have some offbyones;-)...:

    import cStringIO
    import re
    from xml import sax
    from xml.sax import handler
    
    relinend = re.compile(r'\n')
    
    txt = '''<foo>
                <tit>Bar</tit>
            <baz>whatever</baz>
         </foo>'''
    stm = cStringIO.StringIO(txt)
    
    class LocatingWrapper(object):
        def __init__(self, f):
            self.f = f
            self.linelocs = []
            self.curoffs = 0
    
        def read(self, *a):
            data = self.f.read(*a)
            linends = (m.start() for m in relinend.finditer(data))
            self.linelocs.extend(x + self.curoffs for x in linends)
            self.curoffs += len(data)
            return data
    
        def where(self, loc):
            return self.linelocs[loc.getLineNumber() - 1] + loc.getColumnNumber()
    
    locstm = LocatingWrapper(stm)
    
    class Handler(handler.ContentHandler):
        def setDocumentLocator(self, loc):
            self.loc = loc
        def startElement(self, name, attrs):
            print '%s@%s:%s (%s)' % (name, 
                                     self.loc.getLineNumber(),
                                     self.loc.getColumnNumber(),
                                     locstm.where(self.loc))
    
    sax.parse(locstm, Handler())
    

    Of course you don't need to keep all of the linelocs around -- to save memory, you can drop "old" ones (below the latest one queried) but then you need to make linelocs a dict, etc.