Search code examples
pythonxmlparsingxmltodict

Error parsing xml from file but not as string? Python


I am trying to use xml2dict to parse a large number of xml files so that i can turn them into dataframes, however, when I try and parse the actual xml files I get the error:

"ExpatError: not well-formed (invalid token): line 1, column 5"

This error is exactly the same for all the xml files, including "line 1, column 5", which differ considerably in length but are all the same in structure.

When I try to copy the contents of the xml file as a string in python the parsing with xml2dict works perfectly. For example:

xmlstr ="""<?xml version="1.0" encoding="utf-8"?>
<document id="DDI-DrugBank.d200">
    <sentence id="DDI-DrugBank.d200.s0" text="Co-administration of probenecid with acyclovir has been shown to increase the mean half-life and the area under the concentration-time curve.">
        <entity id="DDI-DrugBank.d200.s0.e0" charOffset="21-30"
            type="drug" text="probenecid"/>
        <entity id="DDI-DrugBank.d200.s0.e1" charOffset="37-45"
            type="drug" text="acyclovir"/>
        <pair id="DDI-DrugBank.d200.s0.p0" e1="DDI-DrugBank.d200.s0.e0"
            e2="DDI-DrugBank.d200.s0.e1" ddi="true" type="mechanism"/>
    </sentence>
    <sentence id="DDI-DrugBank.d200.s1" text="Urinary excretion and renal clearance were correspondingly reduced."/>
    <sentence id="DDI-DrugBank.d200.s2" text="The clinical effects of this combination have not been studied."/>
</document>"""

import xmltodict as x2d

nestdict1 = x2d.parse('Train/DrugBank/Aciclovir_ddi.xml')

nestdict2 = x2d.parse(xmlstr)

In the above example nestdict1 throws the error while nestdict2 is fine despite xmlstr being a direct copy and paste from the file 'Train/DrugBank/Aciclovir_ddi.xml'


Solution

  • You need to pass a file object, not a string which is the filename.

    From the docs:

    In [4]:print(xmltodict.parse.__doc__)
    Parse the given XML input and convert it into a dictionary.
    
        `xml_input` can either be a `string` or a file-like object.
    

    So, create a file descriptor like:

    fd = open("Train/DrugBank/Aciclovir_ddi.xml")
    

    And then pass it to the parse method:

    x2d.parse(fd)