I am working with a large xml file in which I have been trying to extract keys and values. The information in this file is very sensitive so I cannot share it. I started by using the xml
library. However, after hours of frustration I discovered the xmltodict
library. I used this library to convert my xml to a dictionary (something I am much more familiar with relative to xml).
import xmltodict
# convert xml to dictionary
dict_nested = xmltodict.parse(str_xml)
Now that the xml is a dictionary, I would like to flatten it because there are a large number of levels (I don't know how many levels), while creating key names that help me trace the path to their corresponding value. Thus, I tried:
from flatten_dict import flatten
# flatten dict_nested
dict_flat = flatten(dict_nested)
The result may look something like this but with many more layers:
{'ID': '123',
'info': [{'breed':'collie'},
{'fur': [{'short':'no'},
{'color':[{'black':'no'},
{'brown':'yes'}]}]}]}
This worked well as my keys are tuples showing the path of layers. My values are either strings (i.e., the end result I for which I am looking) or lists of type OrderedDict.
Since each dictionary in each list needs to be flattened and I don't know how deep this goes I am trying to figure out a way of programmatically flattening all dictionaries until all keys correspond to a single value (i.e., not a list or dictionary).
Ideally, the output would look something like this:
{'ID':'123',
'info_breed':'collie',
'info_fur_short':'no',
'info_fur_color_black':'no',
'info_fur_color_brown':'yes'}
Sorry that I cannot share more of my output because of the sensitive information.
you can use a recursive approach by taking in consideration that your dicts values are strings or lists with other dicts:
dict_flat = {'ID': '123',
'info': [{'breed':'collie'},
{'fur': [{'short':'no'},
{'color':[{'black':'no'},
{'brown':'yes'}]}]}]}
def my_flatten(dict_flat, key_prefix=None):
result = {}
for k, v in dict_flat.items():
key = f'{key_prefix}_{k}' if key_prefix is not None else k
if isinstance(v, list):
for d in v:
result.update(my_flatten(d, key))
else:
result[key] = v
return result
my_flatten(dict_flat)
output:
{'ID': '123',
'info_breed': 'collie',
'info_fur_short': 'no',
'info_fur_color_black': 'no',
'info_fur_color_brown': 'yes'}