Let’s suppose we have the below data (a sample of my whole dataset that counts thousands of rows):
Node Target
1 2
1 3
1 5
2 1
2 3
2 6
7 8
7 12
9 13
9 15
9 14
Clearly, if I plot in a graph this data I have two components that are disconnected. I am wondering how to isolate or remove one component from my network, e.g., the smallest one. I would say that first I should identify the components, then isolate/filtering out the component(s) that I am not interested in.
G = nx.from_pandas_edgelist(df, 'Node', 'Target')
connected_com=[len(c) for c in sorted(nx.connected_components(G), key=len, reverse=True)]
Now I should create a network only with data from the largest component:
largest_cc = max(nx.connected_components(G), key=len)
This is easy in case of two components. However, if I would like to select two components and exclude one, how should I do? This is my question.
In the example data you provided, 3 islands are obtained when plotting the graph with the code below:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import networkx as nx
df=pd.read_fwf('data.txt')
G = nx.from_pandas_edgelist(df, 'Node', 'Target')
nx.draw(G,with_labels=True)
And the graph looks like that:
Now if you want to only keep the biggest two islands you can use the nx.connected_components(G)
function that you mentioned and store the two biggest components. Below is the code to do this:
N_subs=2 #Number of biggest islands you want to keep
G_sub=[]
largest_components=[]
for i in range(N_subs):
largest_components.append(sorted(nx.connected_components(G), key=len, reverse=True)[i])
G_sub.append(G.subgraph(largest_components[i]))
You will then need to create a subgraph of G that is composed of both islands. You can use nx.compose_all
to do that. And you can then just plot your subgraph
G_subgraphs=nx.compose_all(G_sub)
nx.draw(G_subgraphs,with_labels=True)
So overall the code looks like that:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import networkx as nx
df=pd.read_fwf('data.txt')
G = nx.from_pandas_edgelist(df, 'Node', 'Target')
N_subs=2
G_sub=[]
largest_components=[]
for i in range(N_subs):
largest_components.append(sorted(nx.connected_components(G), key=len, reverse=True)[i])
G_sub.append(G.subgraph(largest_components[i]))
G_subgraphs=nx.compose_all(G_sub)
nx.draw(G_subgraphs,with_labels=True)
And the output of this code gives:
Note: According to this, nx.connected_components
is best used for undirected graphs. Since you are dealing with directed graphs, you might want to use strongly_connected_components(G)
or weakly_connected_components(G)
instead.