Search code examples
pythonxmldictionaryordereddictionary

Can I flatten a deeply nested Python dictionary which contains values with lists of more nested dictionaries?


I am working with a large xml file in which I have been trying to extract keys and values. The information in this file is very sensitive so I cannot share it. I started by using the xml library. However, after hours of frustration I discovered the xmltodict library. I used this library to convert my xml to a dictionary (something I am much more familiar with relative to xml).

import xmltodict

# convert xml to dictionary
dict_nested = xmltodict.parse(str_xml)

Now that the xml is a dictionary, I would like to flatten it because there are a large number of levels (I don't know how many levels), while creating key names that help me trace the path to their corresponding value. Thus, I tried:

from flatten_dict import flatten

# flatten dict_nested 
dict_flat = flatten(dict_nested)

The result may look something like this but with many more layers:

{'ID': '123',
 'info': [{'breed':'collie'}, 
          {'fur': [{'short':'no'}, 
                   {'color':[{'black':'no'},
                             {'brown':'yes'}]}]}]}

This worked well as my keys are tuples showing the path of layers. My values are either strings (i.e., the end result I for which I am looking) or lists of type OrderedDict.

Since each dictionary in each list needs to be flattened and I don't know how deep this goes I am trying to figure out a way of programmatically flattening all dictionaries until all keys correspond to a single value (i.e., not a list or dictionary).

Ideally, the output would look something like this:

{'ID':'123',
 'info_breed':'collie',
 'info_fur_short':'no',
 'info_fur_color_black':'no',
 'info_fur_color_brown':'yes'}

Sorry that I cannot share more of my output because of the sensitive information.


Solution

  • you can use a recursive approach by taking in consideration that your dicts values are strings or lists with other dicts:

    dict_flat = {'ID': '123',
     'info': [{'breed':'collie'}, 
              {'fur': [{'short':'no'}, 
                       {'color':[{'black':'no'},
                                 {'brown':'yes'}]}]}]}
    
    def my_flatten(dict_flat, key_prefix=None):
    
        result = {}
        for k, v in dict_flat.items():
            key = f'{key_prefix}_{k}' if key_prefix is not None else k
            if isinstance(v, list):
                for d in v:
                    result.update(my_flatten(d, key))
            else:
                result[key] = v
        return result
    
    my_flatten(dict_flat)
    

    output:

    {'ID': '123',
     'info_breed': 'collie',
     'info_fur_short': 'no',
     'info_fur_color_black': 'no',
     'info_fur_color_brown': 'yes'}