Search code examples
pythonrecursiondictionarytraversal

Find all occurrences of a key in nested dictionaries and lists


I have a dictionary like this:

{
    "id": "abcde",
    "key1": "blah",
    "key2": "blah blah",
    "nestedlist": [
        {
            "id": "qwerty",
            "nestednestedlist": [
                {
                    "id": "xyz",
                    "keyA": "blah blah blah"
                },
                {
                    "id": "fghi",
                    "keyZ": "blah blah blah"
                }
            ],
            "anothernestednestedlist": [
                {
                    "id": "asdf",
                    "keyQ": "blah blah"
                },
                {
                    "id": "yuiop",
                    "keyW": "blah"
                }
            ]
        }
    ]
}

Basically a dictionary with nested lists, dictionaries, and strings, of arbitrary depth.

What is the best way of traversing this to extract the values of every "id" key? I want to achieve the equivalent of an XPath query like "//id". The value of "id" is always a string.

So from my example, the output I need is basically:

["abcde", "qwerty", "xyz", "fghi", "asdf", "yuiop"]

Order is not important.


Solution

  • I found this Q/A very interesting, since it provides several different solutions for the same problem. I took all these functions and tested them with a complex dictionary object. I had to take two functions out of the test, because they had to many fail results and they did not support returning lists or dicts as values, which i find essential, since a function should be prepared for almost any data to come.

    So i pumped the other functions in 100.000 iterations through the timeit module and output came to following result:

    0.11 usec/pass on gen_dict_extract(k,o)
    - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    6.03 usec/pass on find_all_items(k,o)
    - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    0.15 usec/pass on findkeys(k,o)
    - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    1.79 usec/pass on get_recursively(k,o)
    - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    0.14 usec/pass on find(k,o)
    - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    0.36 usec/pass on dict_extract(k,o)
    - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    

    All functions had the same needle to search for ('logging') and the same dictionary object, which is constructed like this:

    o = { 'temparature': '50', 
          'logging': {
            'handlers': {
              'console': {
                'formatter': 'simple', 
                'class': 'logging.StreamHandler', 
                'stream': 'ext://sys.stdout', 
                'level': 'DEBUG'
              }
            },
            'loggers': {
              'simpleExample': {
                'handlers': ['console'], 
                'propagate': 'no', 
                'level': 'INFO'
              },
             'root': {
               'handlers': ['console'], 
               'level': 'DEBUG'
             }
           }, 
           'version': '1', 
           'formatters': {
             'simple': {
               'datefmt': "'%Y-%m-%d %H:%M:%S'", 
               'format': '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
             }
           }
         }, 
         'treatment': {'second': 5, 'last': 4, 'first': 4},   
         'treatment_plan': [[4, 5, 4], [4, 5, 4], [5, 5, 5]]
    }
    

    All functions delivered the same result, but the time differences are dramatic! The function gen_dict_extract(k,o) is my function adapted from the functions here, actually it is pretty much like the find function from Alfe, with the main difference, that i am checking if the given object has iteritems function, in case strings are passed during recursion:

    # python 2
    def gen_dict_extract(key, var):
        if hasattr(var,'iteritems'): # hasattr(var,'items') for python 3
            for k, v in var.iteritems(): # var.items() for python 3
                if k == key:
                    yield v
                if isinstance(v, dict):
                    for result in gen_dict_extract(key, v):
                        yield result
                elif isinstance(v, list):
                    for d in v:
                        for result in gen_dict_extract(key, d):
                            yield result
    

    So this variant is the fastest and safest of the functions here. And find_all_items is incredibly slow and far off the second slowest get_recursivley while the rest, except dict_extract, is close to each other. The functions fun and keyHole only work if you are looking for strings.

    Interesting learning aspect here :)