Search code examples
pythontreegeneratorpython-itertools

Chaining of Python generators to traverse a nested data structure; best practices


Suppose I have a nested data structure that I want to traverse. This data structure contains nodes which in turn may offer their children via node.get_children_generator(). Of course, these children are also of type node and are evaluated in a lazy manner, i.e. enumerated by a generator. For simplicity let us assume if the the node has no children, the function get_children_generator simply returns an empty list/generator (so we don't have to check that it is empty manually).

In order to traverse this data structure of nested nodes, is it a good idea to simply chain all generators iteratively? That is creating chains of chains of chains and so forth? Or would this create too much overhead?

What I have in mind is something like the following:

import itertools as it

def traverse_nodes(start_node):
    """Traverses nodes in breadth first manner.
    
    First returns the start node.
    
    For simplicity we require that 
    there are no cycles in the data structure,
    i.e. we are dealing with a simple tree.
    
    """
    node_queue = iter([start_node])
    while True:
        try:
            next_node = node_queue.next()
            yield next_node
            
            # Next get the children
            child_gen = next_node.get_children_generator()
            
            # The next code line is the one I am worried about
            # is it a good idea to make a chain of chains?
            node_queue = it.chain(node_queue, child_gen)
        except StopIteration:
            # There are no more nodes
            break

Is the line node_queue = it.chain(node_queue, child_gen) a good approach to the traversal? Is it a good idea make a chain of chains of chains, etc.?

Such that you have actually something you can execute, here is a rather silly dummy node class. The generator is a bit useless, but assume that in a real world example evaluating the children is a bit costly and really requires a generator.

class Node(object):
    """Rather silly example of a nested node.

    The children are actually stored in a list,
    so the generator is actually not needed.
    
    But simply assume that returning a children
    requires a lazy evaluation.
    
    """
    counter = 0 # Counter for node identification
    
    def __init__(self):
        self.children = [] # children list
        self.node_number = Node.counter # identifies the node
        Node.counter += 1
    
    def __repr__(self):
        return 'I am node #%d' % self.node_number
    
    def get_children_generator(self):
        """Returns a generator over children"""
        return (x for x in self.children)

So the following code snippet

node0 = Node()
node1 = Node()
node2 = Node()
node3 = Node()
node4 = Node()
node5 = Node()
node6 = Node()
node0.children = [node1, node2]
node1.children = [node6]
node2.children = [node3, node5]
node3.children = [node4]

for node in traverse_nodes(node0):
    print(node)

prints

I am node #0

I am node #1

I am node #2

I am node #6

I am node #3

I am node #5

I am node #4


Solution

  • Chaining multiple chains results in a recursive functions call overhead proportional to the amount of chains chained together.

    First of all, our pure python chain implementation so that we won't lose stack info. The C implementation is here and you can see it does basically the same thing - calls the next() on the underlying iterable.

    from inspect import stack
    
    
    def chain(it1, it2):
        for collection in [it1, it2]:
            try:
                for el in collection:
                    yield el
    
            except StopIteration:
                pass
    

    We care only about a 2-iterables version of chain. We first consume the first iterable, then the other one.

    class VerboseListIterator(object):
        def __init__(self, collection, node):
            self.collection = collection
            self.node = node
            self.idx = 0
    
        def __iter__(self):
            return self
    
        def __next__(self):
            print('Printing {}th child of "{}". Stack size: {}'.format(self.idx, self.node, len(stack())))
            if self.idx >= len(self.collection):
                raise StopIteration()
    
            self.idx += 1
            return self.collection[self.idx - 1]
    

    This is our handy list iterator that will tell us how many stack frames are there when next element of wrapped list is returned.

    class Node(object):
        """Rather silly example of a nested node.
    
        The children are actually stored in a list,
        so the generator is actually not needed.
    
        But simply assume that returning a children
        requires a lazy evaluation.
    
        """
        counter = 0 # Counter for node identification
    
        def __init__(self):
            self.children = [] # children list
            self.node_number = Node.counter # identifies the node
            Node.counter += 1
    
        def __repr__(self):
            return 'I am node #%d' % self.node_number
    
        def get_children_generator(self):
            """Returns a generator over children"""
            return VerboseListIterator(self.children, self)
    
    
    def traverse_nodes(start_node):
        """Traverses nodes in breadth first manner.
    
        First returns the start node.
    
        For simplicity we require that
        there are no cycles in the data structure,
        i.e. we are dealing with a simple tree.
    
        """
        node_queue = iter([start_node])
        while True:
            try:
                next_node = next(node_queue)
                yield next_node
    
                # Next get the children
                child_gen = next_node.get_children_generator()
    
                # The next code line is the one I am worried about
                # is it a good idea to make a chain of chains?
                node_queue = chain(node_queue, child_gen)
            except StopIteration:
                # There are no more nodes
                break
    

    These are your implementations in respect to Python version used (3.4).

    nodes = [Node() for _ in range(10)]
    nodes[0].children = nodes[1:6]
    nodes[1].children = [nodes[6]]
    nodes[2].children = [nodes[7]]
    nodes[3].children = [nodes[8]]
    nodes[4].children = [nodes[9]]
    

    Nodes' graph initiaization. The root is connected to the first 5 nodes, these in turn are connected to i + 5th node.

    for node in traverse_nodes(nodes[0]):
        print(node)
    

    The result of this interation is as follows:

    I am node #0
    Printing 0th child of "I am node #0". Stack size: 4
    I am node #1
    Printing 1th child of "I am node #0". Stack size: 5
    I am node #2
    Printing 2th child of "I am node #0". Stack size: 6
    I am node #3
    Printing 3th child of "I am node #0". Stack size: 7
    I am node #4
    Printing 4th child of "I am node #0". Stack size: 8
    I am node #5
    Printing 5th child of "I am node #0". Stack size: 9
    Printing 0th child of "I am node #1". Stack size: 8
    I am node #6
    Printing 1th child of "I am node #1". Stack size: 9
    Printing 0th child of "I am node #2". Stack size: 8
    I am node #7
    Printing 1th child of "I am node #2". Stack size: 9
    Printing 0th child of "I am node #3". Stack size: 8
    I am node #8
    Printing 1th child of "I am node #3". Stack size: 9
    Printing 0th child of "I am node #4". Stack size: 8
    I am node #9
    Printing 1th child of "I am node #4". Stack size: 9
    Printing 0th child of "I am node #5". Stack size: 8
    Printing 0th child of "I am node #6". Stack size: 7
    Printing 0th child of "I am node #7". Stack size: 6
    Printing 0th child of "I am node #8". Stack size: 5
    Printing 0th child of "I am node #9". Stack size: 4
    

    As you can see, the closer we got to the end of the node0's children list, the bigger was the stack. Why is that? Let's take closer look at each step - each chain call is enumerated for clarification:

    1. node_queue = [node0]
    2. Called next(node_queue), yielded node0. node_queue = chain1([node0], [node1, node2, node3, node4, node5]).
    3. Called next(node_queue). The list [node0] is consumed, and the second list is being consumed. node1 gets yielded, and node_queue = chain2(chain1([node0], [node1, ...]), [node6]).
    4. Calling next(node_queue) propagates down to chain1 (from chain2), and node2 is yielded. node_queue = chain3(chain2(chain1([node0], [...]), [node6]), [node7]).
    5. The pattern continues on to when we are about to yield node5:

      next(chain5(chain4, [node9]))
                    |
                    V
          next(chain4(chain3, [node8]))
                        |
                        V
              next(chain3(chain2, [node7]))
                            |
                            V
                  next(chain2(chain1, [node6]))
                                |
                                V
                      next(chain1([node0], [node1, node2, node3, node4, node5]))
                                                                     ^
                                                                   yield
      

    Is it faster than simple BFS/DFS?

    Rather not. A single next(node_queue) call can actually cause lots of recursive calls, proportional to the size of regular iterators queue in a BFS each, or in simple words - the maximum amount of children for a node in the graph.

    tl; dr

    Here's a gif showing the algorithm: http://i.imgur.com/hnPIVG4.gif