Search code examples
python-3.xdictionarygraphdepth-first-searchdirected-graph

Find all cycles with at least 3 nodes in a directed graph using dictionary data structure


Directed Graph

The above graph was drawn using LaTeX: https://www.overleaf.com/read/rxhpghzbkhby

The above graph is represented as a dictionary in Python.

graph = {
    'A' : ['B','D', 'C'],
    'B' : ['C'],
    'C' : [],
    'D' : ['E'],
    'E' : ['G'],
    'F' : ['A', 'I'],
    'G' : ['A', 'K'],
    'H' : ['F', 'G'],
    'I' : ['H'],
    'J' : ['A'],
    'K' : []
}

I have a large graph of about 3,378,546 nodes.

Given the above-directed graph, I am trying to find circles with at least 3 and less than 5 different nodes, and output the first 3 circles.

I spent 1 day and a half on this problem. I looked in Stackoverflow and even tried to follow this Detect Cycle in a Directed Graph tutorial but couldn't come up with a solution.

In this example, the output is a tab-delimited text file where each line has a cycle.

0 A, D, E, G
1 F, I, H

0 and 1 are indexes. Also, there is no order in the alphabet of the graph nodes.

I tried this form How to implement depth-first search in Python tutorial:

visited = set()

def dfs(visited, graph, node):
    if node not in visited:
        print (node)
        visited.add(node)
        for neighbour in graph[node]:
            dfs(visited, graph, neighbour)

dfs(visited, graph, 'A')

But this doesn't help. I also tried this Post


Solution

  • NOTE: This solution is an extended solution to the describe one. I extended to the original graph with ~3million nodes and I look for all cycles that are at least 3 nodes and less than 40 nodes and store the first 3 cycles into a file.


    I came up with the following solution.

    # implementation of Johnson's cycle finding algorithm
    # Original paper: Donald B Johnson. "Finding all the elementary circuits of a directed graph." SIAM Journal on Computing. 1975.
    
    from collections import defaultdict
    
    import networkx as nx
    from networkx.utils import not_implemented_for, pairwise
    
    @not_implemented_for("undirected")
    def findCycles(G):
        """Find simple cycles of a directed graph.
        A `simple cycle` is a closed path where no node appears twice.
        Two elementary circuits are distinct if they are not cyclic permutations of each other.
        This is iterator/generator version of Johnson's algorithm [1]_.
        There may be better algorithms for some cases [2]_ [3]_.
        Parameters
        ----------
        G : NetworkX DiGraph
           A directed graph
        Returns
        -------
        cycle_generator: generator
           A generator that produces elementary cycles of the graph.
           Each cycle is represented by a list of nodes along the cycle.
        Examples
        --------
        >>> graph = {'A' : ['B','D', 'C'],
                     'B' : ['C'],
                     'C' : [],
                     'D' : ['E'],
                     'E' : ['G'],
                     'F' : ['A', 'I'],
                     'G' : ['A', 'K'],
                     'H' : ['F', 'G'],
                     'I' : ['H'],
                     'J' : ['A'],
                     'K' : []
                    }
        >>> G = nx.DiGraph()
        >>> G.add_nodes_from(graph.keys())
        >>> for keys, values in graph.items():
                G.add_edges_from(([(keys, node) for node in values]))
        >>> list(nx.findCycles(G))
        [['F', 'I', 'H'], ['G', 'A', 'D', 'E']]
        Notes
        -----
        The implementation follows pp. 79-80 in [1]_.
        The time complexity is $O((n+e)(c+1))$ for $n$ nodes, $e$ edges and $c$
        elementary circuits.
        References
        ----------
        .. [1] Finding all the elementary circuits of a directed graph.
           D. B. Johnson, SIAM Journal on Computing 4, no. 1, 77-84, 1975.
           https://doi.org/10.1137/0204007
        .. [2] Enumerating the cycles of a digraph: a new preprocessing strategy.
           G. Loizou and P. Thanish, Information Sciences, v. 27, 163-182, 1982.
        .. [3] A search strategy for the elementary cycles of a directed graph.
           J.L. Szwarcfiter and P.E. Lauer, BIT NUMERICAL MATHEMATICS,
           v. 16, no. 2, 192-204, 1976.
        --------
        """
    
        def _unblock(thisnode, blocked, B):
            stack = {thisnode}
            while stack:
                node = stack.pop()
                if node in blocked:
                    blocked.remove(node)
                    stack.update(B[node])
                    B[node].clear()
    
        # Johnson's algorithm requires some ordering of the nodes.
        # We assign the arbitrary ordering given by the strongly connected comps
        # There is no need to track the ordering as each node removed as processed.
        # Also we save the actual graph so we can mutate it. We only take the
        # edges because we do not want to copy edge and node attributes here.
        subG = type(G)(G.edges())
        sccs = [scc for scc in nx.strongly_connected_components(subG) if len(scc) in list(range(3, 41))]
    
        # Johnson's algorithm exclude self cycle edges like (v, v)
        # To be backward compatible, we record those cycles in advance
        # and then remove from subG
        for v in subG:
            if subG.has_edge(v, v):
                yield [v]
                subG.remove_edge(v, v)
    
        while sccs:
            scc = sccs.pop()
            sccG = subG.subgraph(scc)
            # order of scc determines ordering of nodes
            startnode = scc.pop()
            # Processing node runs "circuit" routine from recursive version
            path = [startnode]
            blocked = set()  # vertex: blocked from search?
            closed = set()  # nodes involved in a cycle
            blocked.add(startnode)
            B = defaultdict(set)  # graph portions that yield no elementary circuit
            stack = [(startnode, list(sccG[startnode]))]  # sccG gives comp nbrs
            while stack:
                thisnode, nbrs = stack[-1]
                if nbrs:
                    nextnode = nbrs.pop()
                    if nextnode == startnode:
                        yield path[:]
                        closed.update(path)
                    #                        print "Found a cycle", path, closed
                    elif nextnode not in blocked:
                        path.append(nextnode)
                        stack.append((nextnode, list(sccG[nextnode])))
                        closed.discard(nextnode)
                        blocked.add(nextnode)
                        continue
                # done with nextnode... look for more neighbors
                if not nbrs:  # no more nbrs
                    if thisnode in closed:
                        _unblock(thisnode, blocked, B)
                    else:
                        for nbr in sccG[thisnode]:
                            if thisnode not in B[nbr]:
                                B[nbr].add(thisnode)
                    stack.pop()
                    path.pop()
            # done processing this node
            H = subG.subgraph(scc)  # make smaller to avoid work in SCC routine
            sccs.extend(scc for scc in nx.strongly_connected_components(H) if len(scc) in list(range(3, 41)))
    
    import sys, csv, json
    
    def findAllCycles(jsonInputFile, textOutFile):
        """Find simple cycles of a directed graph (jsonInputFile).
        Parameters:
        ----------
            jsonInputFile: a json file that has all concepts
            textOutFile: give a desired name of output file
        Returns:
        ----------
            a .text file (named: {textOutFile}.txt) has the first 3 cycles found in jsonInputFile
            Each cycle is represented by a list of nodes along the cycle
        """
    
        with open(jsonInputFile) as infile:
            graph = json.load(infile)
        # Convert the json file to a NetworkX directed graph
        G = nx.DiGraph()
        G.add_nodes_from(graph.keys())
        for keys, values in graph.items():
            G.add_edges_from(([(keys, node) for node in values]))
        # Search for all simple cycles existed in the graph
        _cycles = list(findCycles(G))
        # Start with an empty list and populate it by looping over all cycles
        # in _cycles that have at least 3 and less than 40 different concepts (nodes)
        cycles = []
        for cycle in _cycles:
            if len(cycle) in list(range(3, 41)):
                cycles.append(cycle)
        # Store the cycels under constraint in {textOutFile}.txt
        with open(textOutFile, 'w') as outfile:
            for cycle in cycles[:3]:
                outfile.write(','.join(n for n in cycle)+'\n')
            outfile.close()
        # When process finishes, print Done!!
        return 'Done!!'
    
    
    infile = sys.argv[1]
    outfile = sys.argv[2]
    first_cycles = findAllCycles(infile, outfile)
    
    

    To run this program, you simply use a command line as follows:

    >> python3 {program file name}.py graph.json {desired output file name}[.txt][.csv]
    

    let, for example {desired output file name}}.[txt][.csv], be first_3_cycles_found.txt

    In my case, the graph has 3,378,546 nodes which took me ~40min to find all cycles using the above code. Thus, the output file will be:

    enter image description here

    Please contribute to this if you see it needs any improvement or something else to be added.