Search code examples
pythonnetworkxsubgraphconnected-components

Largest strongly connected components of a directed graph


I am working on an Networkx .MultiDiGraph() object built from a total of 82927 directed email data. At current stage, I am trying to get the largest strongly connected components from the .MultiDiGraph() object and its corresponding subgraph. The text data can be accessed here. Here's my working code:

import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt
email_df = pd.read_csv('email_network.txt', delimiter = '->')
edge_groups = email_df.groupby(["#Sender", "Recipient"], as_index=False).count().rename(columns={"time":"weight"})
email = nx.from_pandas_dataframe(edge_groups, '#Sender', 'Recipient', edge_attr = 'weight')
G = nx.MultiDiGraph()
G.add_edges_from(email.edges(data=True))

# G is a .MultiDiGraph object
# using .strongly_connected_components() to get the part of G that has the most nodes
# using list comprehension
number_of_nodes = [len(n) for n in sorted(nx.strongly_connected_components(G))]
number_of_nodes

# 'number_of_nodes' return a list of [1, 1, 1,...,1] of length 167 (which is the exact number of nodes in the network)

# using the recommended method in networkx documentation

largest = max(nx.strongly_connected_components(G), key=len)
largest

# 'largest' returns {92}, not sure what this means... 

As I noted in the above code block, the list comprehension method returns a list of [1, 1, 1,..., 1] of length 167 (which is the total number of nodes in my data), while the max(nx.strongly_connected_components(G), key=len) returned {92}, I am not sure what this means.

It looks like there's something wrong with my code and I might have missed several key steps in processing the data. Could anyone care to take a look at and enlighten me on this?

Thank you.

Note: Revised code (kudos to Eric and Joel)

import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt
email_df = pd.read_csv('email_network.txt', delimiter = '	')
edge_groups = email_df.groupby(["#Sender", "Recipient"], as_index=False).count().rename(columns={"time":"weight"})
# per @Joel's comment, adding 'create_using = nx.DiGraph()'     
email = nx.from_pandas_dataframe(edge_groups, '#Sender', 'Recipient', edge_attr = 'weight', create_using = nx.DiGraph())
# adding this 'directed' edge list to .MultiDiGraph() object

G = nx.MultiDiGraph()
G.add_edges_from(email.edges(data=True))

We now examine the largest strongly connected component (in terms of the number of nodes) in this network.

In [1]: largest = max(nx.strongly_connected_components(G), key=len)
In [2]: len(largest)
Out [2]: 126

The largest strongly connected component consists of 126 nodes.

[Updates] Upon further trial and error, I found that one needs to use create_using = .MultiDiGraph() (instead of .DiGraph()) when loading data onto networkx, otherwise, even if you get correct number of nodes for your MultiDiGraph and its weakly/strongly connected subgraphs, you might still get the number of edges wrong! This will reflect in you .strongly_connected_subgraphs() outputs.

For my case here, I will recommend others to use this one-liner

import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt
G = nx.read_edgelist(path="email_network.txt", data=[('time', int)], create_using=nx.MultiDiGraph(), nodetype=str)

And we can implement .strongly_connected_components(G) and strongly_connected_subgraphs to verify.

If you use the networkx output G from the first code block, max(nx.strongly_connected_components(G), key=len) will give an output with 126 nodes and 52xx something edges, but if you apply the one-liner I listed above, you will get:

In [1]: largest = max(nx.strongly_connected_components(G), key=len)
In [2]: G_sc = max(nx.strongly_connected_subgraphs(G), key=len)

In [3]: nx.number_of_nodes(G_sc) 
Out [3]: 126
In [4]: nx.number_of_nodes(G_sc) 
Out [4]: 82130

You will get the same number of nodes with both methods but different number of edges owing to different counting mechanisms associated with different networkx graph classes.


Solution

  • The underlying cause of your error is that nx.from_pandas_dataframe defaults to creating an undirected graph. So email is an undirected graph. When you then create the directed graph, each edge appears in only one direction.

    To fix it use nx.from_pandas_dataframe with the argument create_using = DiGraph


    older comments related to the output you were getting

    All your strongly connected components have a single node.

    When you do max(nx.strongly_connected_components(G), key=len) it finds the set of nodes which has the longest length and returns it. In your case, they all have length 1, so it returns one of them (I believe whichever networkx happened to put into nx.strongly_connected_components(G) first). But it's returning the set, not the length. So {92} is the set of nodes it is returning.

    It happens that {92} was chosen to be the "longest" length 1 component in nx.strongly_connected_components(G) by the tiebreaker.

    Example:

    max([{1}, {3}, {5}], key = len)
    > {1}