Search code examples
pythonpathtreebacktracking

How to efficiently visit multiple paths of a nested dictionary with nested lists


Say I have a nested dictionary like so:

{
    'a': {
        'z': [
            {
                'x': 1,
                'y': [
                    [{'a': 10}, {'f': 21}],
                    {'m': 100},
                ]
            }
        ]
    }
    'm': {
        'x': 100
    }
}

And let's say I have a list of paths that have no relative order to each other:

[('a', 'z', 'x'), ('m', 'x'), ('a', 'z', 'y', 'a'), ...]

And I want to update whatever value the final key is for all these paths to some value val. Notice that in the third tuple, the last key is a, but it sits in a list.

In my data, these assumptions can be made:

  • key names can be repeated in a single path (notice 'a' is the source and destination node in the third tuple)
  • destination "nodes" are always a string, int, float, boolean, i.e. - never ending on a list or another dictionary
  • lists only hold other lists or dictionaries

My thoughts:

  1. My current implementation is a brute force approach which is too slow for my requirements: iterate from source to destination for every single one of these paths, detecting if I am on a list or dictionary.
  2. Ideally, I would back track and visit nearby nodes that need to be visited as well, so that I don't have to start from the top level again. This kind of reminds me of visiting multiple stops/cities on a trip.
  3. I was also thinking of a trie, but I think that would still mean starting from source to destination. Also I don't think a trie would work since we have lists.

Any thoughts?


Solution

  • You can cache previously-visited nodes in a dictionary and store their references.

    # Helper function to flatten nested lists.
    def flatten(xs):
        for x in xs:
            if isinstance(x, list):
                yield from flatten(x)
            else:
                yield x
    
    
    # The cache stores all already-visited paths.
    cache = {}
    
    
    def setValue(nested_dict, path, value):
        queue = nested_dict
        depth = 0
    
        # Find the largest subpath that's already in the cache (=already visited),
        # starting with the full path and removing one key at a time from the end.
        for end_index in range(len(path), 0, -1):
            if path[0:end_index] in cache:
                queue = cache[path[0:end_index]]
                depth = len(path[0:end_index])
                break
    
        # For the rest of the keys in the path (unvisited), visit them and
        # store their references in the cache.
        while depth < len(path) - 1:
            if type(queue) is dict:
                queue = queue[path[depth]]
            elif type(queue) is list:
                queue = [i for i in flatten(queue) if path[depth] in i][0][path[depth]]
            else:
                raise TypeError(f"Unknown data type at {path}[{depth}]")
    
            cache[path[0 : depth + 1]] = queue
            depth += 1
    
        # For the final key, change its dictionary value to 'value'.
        if type(queue) is dict:
            queue[path[depth]] = value
        elif type(queue) is list:
            [i for i in flatten(queue) if path[depth] in i][0][path[depth]] = value
        else:
            raise TypeError(f"Unknown data type at {path}[{depth}]")
    
    
    data = {
        'a': {
            'z': [
                {
                    'x': 1,
                    'y': [
                        [{'a': 10}, {'f': 21}],
                        {'m': 100},
                    ]
                }
            ]
        },
        'm': {
            'x': 100
        }
    }
    
    setValue(data, ("a", "z", "x"), 900)
    setValue(data, ("a", "z", "y", "a"), 901)
    setValue(data, ("a", "z", "y", "f"), 902)
    setValue(data, ("a", "z", "y", "m"), 903)
    setValue(data, ("m", "x"), 904)
    
    print(data)
    

    The nested lists make traversal slightly tricky, but they can just be flattened. Looking up an item by value in a list will obviously be slower than looking up a key in a dictionary because you lose the hash table, but that's mitigated in this case by only having to do it once per node.