How to remove specific components in a disconnected network

Let’s suppose we have the below data (a sample of my whole dataset that counts thousands of rows):

Node Target 
1      2
1      3
1      5
2      1
2      3
2      6
7      8 
7      12  
9      13
9      15
9      14

Clearly, if I plot in a graph this data I have two components that are disconnected. I am wondering how to isolate or remove one component from my network, e.g., the smallest one. I would say that first I should identify the components, then isolate/filtering out the component(s) that I am not interested in.

G = nx.from_pandas_edgelist(df, 'Node', 'Target')

connected_com=[len(c) for c in sorted(nx.connected_components(G), key=len, reverse=True)]

Now I should create a network only with data from the largest component:

largest_cc = max(nx.connected_components(G), key=len)

This is easy in case of two components. However, if I would like to select two components and exclude one, how should I do? This is my question.

Solution

In the example data you provided, 3 islands are obtained when plotting the graph with the code below:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import networkx as nx 

df=pd.read_fwf('data.txt')
G = nx.from_pandas_edgelist(df, 'Node', 'Target')
nx.draw(G,with_labels=True)

And the graph looks like that:

Now if you want to only keep the biggest two islands you can use the nx.connected_components(G) function that you mentioned and store the two biggest components. Below is the code to do this:

N_subs=2 #Number of biggest islands you want to keep
G_sub=[]
largest_components=[]
for i in range(N_subs):
  largest_components.append(sorted(nx.connected_components(G), key=len, reverse=True)[i])
  G_sub.append(G.subgraph(largest_components[i]))

You will then need to create a subgraph of G that is composed of both islands. You can use nx.compose_all to do that. And you can then just plot your subgraph

G_subgraphs=nx.compose_all(G_sub)
nx.draw(G_subgraphs,with_labels=True)

So overall the code looks like that:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import networkx as nx 
df=pd.read_fwf('data.txt')
G = nx.from_pandas_edgelist(df, 'Node', 'Target')

N_subs=2
G_sub=[]
largest_components=[]
for i in range(N_subs):
  largest_components.append(sorted(nx.connected_components(G), key=len, reverse=True)[i])
  G_sub.append(G.subgraph(largest_components[i]))

G_subgraphs=nx.compose_all(G_sub)
nx.draw(G_subgraphs,with_labels=True)

And the output of this code gives:

Note: According to this, nx.connected_components is best used for undirected graphs. Since you are dealing with directed graphs, you might want to use strongly_connected_components(G) or weakly_connected_components(G) instead.