I am working on an Networkx
.MultiDiGraph()
object built from a total of 82927 directed email data. At current stage, I am trying to get the largest strongly connected components from the .MultiDiGraph()
object and its corresponding subgraph.
The text data can be accessed here.
Here's my working code:
import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt
email_df = pd.read_csv('email_network.txt', delimiter = '->')
edge_groups = email_df.groupby(["#Sender", "Recipient"], as_index=False).count().rename(columns={"time":"weight"})
email = nx.from_pandas_dataframe(edge_groups, '#Sender', 'Recipient', edge_attr = 'weight')
G = nx.MultiDiGraph()
G.add_edges_from(email.edges(data=True))
# G is a .MultiDiGraph object
# using .strongly_connected_components() to get the part of G that has the most nodes
# using list comprehension
number_of_nodes = [len(n) for n in sorted(nx.strongly_connected_components(G))]
number_of_nodes
# 'number_of_nodes' return a list of [1, 1, 1,...,1] of length 167 (which is the exact number of nodes in the network)
# using the recommended method in networkx documentation
largest = max(nx.strongly_connected_components(G), key=len)
largest
# 'largest' returns {92}, not sure what this means...
As I noted in the above code block, the list comprehension method returns a list of [1, 1, 1,..., 1] of length 167 (which is the total number of nodes in my data), while the max(nx.strongly_connected_components(G), key=len)
returned {92}
, I am not sure what this means.
It looks like there's something wrong with my code and I might have missed several key steps in processing the data. Could anyone care to take a look at and enlighten me on this?
Thank you.
Note: Revised code (kudos to Eric and Joel)
import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt
email_df = pd.read_csv('email_network.txt', delimiter = ' ')
edge_groups = email_df.groupby(["#Sender", "Recipient"], as_index=False).count().rename(columns={"time":"weight"})
# per @Joel's comment, adding 'create_using = nx.DiGraph()'
email = nx.from_pandas_dataframe(edge_groups, '#Sender', 'Recipient', edge_attr = 'weight', create_using = nx.DiGraph())
# adding this 'directed' edge list to .MultiDiGraph() object
G = nx.MultiDiGraph()
G.add_edges_from(email.edges(data=True))
We now examine the largest strongly connected component (in terms of the number of nodes) in this network.
In [1]: largest = max(nx.strongly_connected_components(G), key=len)
In [2]: len(largest)
Out [2]: 126
The largest strongly connected component consists of 126 nodes.
[Updates]
Upon further trial and error, I found that one needs to use create_using = .MultiDiGraph()
(instead of .DiGraph()
) when loading data onto networkx
, otherwise, even if you get correct number of nodes for your MultiDiGraph
and its weakly/strongly connected subgraphs, you might still get the number of edges wrong! This will reflect in you .strongly_connected_subgraphs()
outputs.
For my case here, I will recommend others to use this one-liner
import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt
G = nx.read_edgelist(path="email_network.txt", data=[('time', int)], create_using=nx.MultiDiGraph(), nodetype=str)
And we can implement .strongly_connected_components(G)
and strongly_connected_subgraphs
to verify.
If you use the networkx
output G
from the first code block, max(nx.strongly_connected_components(G), key=len)
will give an output with 126 nodes and 52xx something edges, but if you apply the one-liner I listed above, you will get:
In [1]: largest = max(nx.strongly_connected_components(G), key=len)
In [2]: G_sc = max(nx.strongly_connected_subgraphs(G), key=len)
In [3]: nx.number_of_nodes(G_sc)
Out [3]: 126
In [4]: nx.number_of_nodes(G_sc)
Out [4]: 82130
You will get the same number of nodes with both methods but different number of edges owing to different counting mechanisms associated with different networkx
graph classes.
The underlying cause of your error is that nx.from_pandas_dataframe
defaults to creating an undirected graph. So email
is an undirected graph. When you then create the directed graph, each edge appears in only one direction.
To fix it use nx.from_pandas_dataframe
with the argument create_using = DiGraph
older comments related to the output you were getting
All your strongly connected components have a single node.
When you do max(nx.strongly_connected_components(G), key=len)
it finds the set of nodes which has the longest length and returns it. In your case, they all have length 1, so it returns one of them (I believe whichever networkx happened to put into nx.strongly_connected_components(G)
first). But it's returning the set, not the length. So {92}
is the set of nodes it is returning.
It happens that {92}
was chosen to be the "longest" length 1 component in nx.strongly_connected_components(G)
by the tiebreaker.
Example:
max([{1}, {3}, {5}], key = len)
> {1}