Search code examples
pythonxmldictionaryxml-parsingxmltodict

Parsing an xml file with an ordered dictionary


I have an xml file of the form:

<NewDataSet>
    <Root>
        <Phonemic>and</Phonemic>
        <Phonetic>nd</Phonetic>
        <Description/>
        <Start>0</Start>
        <End>8262</End>
    </Root>
    <Root>
        <Phonemic>comfortable</Phonemic>
        <Phonetic>comfetebl</Phonetic>
        <Description>adj</Description>
        <Start>61404</Start>
        <End>72624</End>
    </Root>
</NewDataSet>

I need to process it so that, for instance, when the user inputs nd, the program matches it with the <Phonetic> tag and returns and from the <Phonemic> part. I thought maybe if I can convert the xml file to a dictionary, I would be able to iterate over the data and find information when needed.

I searched and found xmltodict which is used for the same purpose:

import xmltodict
with open(r'path\to\1.xml', encoding='utf-8', errors='ignore') as fd:
    obj = xmltodict.parse(fd.read())

Running this gives me an ordered dict:

>>> obj
OrderedDict([('NewDataSet', OrderedDict([('Root', [OrderedDict([('Phonemic', 'and'), ('Phonetic', 'nd'), ('Description', None), ('Start', '0'), ('End', '8262')]), OrderedDict([('Phonemic', 'comfortable'), ('Phonetic', 'comfetebl'), ('Description', 'adj'), ('Start', '61404'), ('End', '72624')])])]))])

Now this unfortunately hasn't made things simpler and I am not sure how to go about implementing the program with the new data structure. For example to access nd I'd have to write:

obj['NewDataSet']['Root'][0]['Phonetic']

which is ridiculously complicated. I tried to make it into a regular dictionary by dict() but as it is nested, the inner layers remain ordered and my data is so big.


Solution

  • If you are accessing this as obj['NewDataSet']['Root'][0]['Phonetic'], IMO, you are not doing it right.

    Instead, you can do the following

    obj = obj["NewDataSet"]
    root_elements = obj["Root"] if type(obj) == OrderedDict else [obj["Root"]] 
    # Above step ensures that root_elements is always a list
    for element in root_elements:
        print element["Phonetic"]
    

    Even though this code looks much more longer, the advantage is that it will be lot more compact and modular once you start dealing with sufficiently large xml.

    PS: I had the same issues with xmltodict. But instead of parsing using xml.etree.ElementTree to parse the xml files, xmltodict was much easier to work with as the code base was smaller, and I didn't have to deal with other inanities of the xml module.

    EDIT

    Following code works for me

    import xmltodict
    from collections import OrderedDict
    
    xmldata = """<NewDataSet>
        <Root>
            <Phonemic>and</Phonemic>
            <Phonetic>nd</Phonetic>
            <Description/>
            <Start>0</Start>
            <End>8262</End>
        </Root>
        <Root>
            <Phonemic>comfortable</Phonemic>
            <Phonetic>comfetebl</Phonetic>
            <Description>adj</Description>
            <Start>61404</Start>
            <End>72624</End>
        </Root>
    </NewDataSet>"""
    
    obj = xmltodict.parse(xmldata)
    obj = obj["NewDataSet"]
    root_elements = obj["Root"] if type(obj) == OrderedDict else [obj["Root"]] 
    # Above step ensures that root_elements is always a list
    for element in root_elements:
        print element["Phonetic"]