Search code examples
python-2.7lxmliterparse

Bypass file as parameter with a string for lxml iterparse function using Python 2.7


I am interating over an xml tree using the lxml.tree function iterparse().

This works ok with an input file

xml_source = "formatted_html_diff.xml"
context = ET.iterparse(xml_source, events=("start",))
event, root = context.next()

However, I would like to use a string containing the same information in the file.

I tried using

context = ET.iterparse(StringIO(result), events=("start",))

But this causes the following error:

Traceback (most recent call last):
  File "c:/Users/pag/Documents/12_raw_handle/remove_from_xhtmlv02.py", line 96, in <module>
    event, root = context.next()
  File "src\lxml\iterparse.pxi", line 209, in lxml.etree.iterparse.__next__
TypeError: reading file objects must return bytes objects

Does anyone know how could I solve this error?

Thanks in advance.


Solution

  • Use BytesIO instead of StringIO. The following code works with both Python 2.7 and Python 3:

    from lxml import etree 
    from io import BytesIO
     
    xml = """
    <root>
     <a/>
     <b/>
    </root>"""
     
    context = etree.iterparse(BytesIO(xml.encode("UTF-8")), events=("start",))
     
    print(next(context))
    print(next(context))
    print(next(context))
    

    Output:

    ('start', <Element root at 0x315dc10>)
    ('start', <Element a at 0x315dbc0>)
    ('start', <Element b at 0x315db98>)