python pandas pandas-groupby grouping networkx

Group by year in NetworkX to calculate annual number of connections

I have a dataframe of two IDs and year. IDs in the same row mean there is a connection. I want to group by year to calculate total connections for each year for an ID.

I used NetworkX for counting the connections to count considering only ID1 and ID2, but can't figure out how to group by year.

import pandas as pd
import networkx as nx
d = {'ID1': [21, 21, 21, 21, 21], 'ID2': [343252, 44134, 41314, 161345, 89479],'year': [2010, 2010, 2010, 2011, 2011]}
df = pd.DataFrame(data=d)
G = nx.Graph()
G = nx.from_pandas_edgelist(df, 'ID1', 'ID2')
dict = {}
for x in G.nodes:
    dict[x] = len(G[x])
s = pd.Series(dict, name='connections')
df1 = s.to_frame().sort_values('connections', ascending=False)

This gives me the counted connections irrespective of year.

What I want to do is create a graph for each year present in the dataset (data spans 30 years), and calculate connections for that year, and add it to the database. Is there any modification I can add to do this, considering I have a quite large database of variables? I considered creating a loop by year to sort through the data and create a graph for each, but it is quite inefficient since I have millions of rows.

Solution

2 suggestions to modify your approach:

def count_connects(sdf):
    G = nx.from_pandas_edgelist(sdf, "ID1", "ID2")
    return pd.DataFrame.from_dict(
        {n: len(G[n]) for n in G.nodes}, orient="index"
    )

# Version 1
df_connects = (
    df.groupby("year").apply(count_connects)
      .reset_index(level=1)
      .rename(columns={"level_1": "node", 0: "connections"})
)

# Version 2
df_connects = pd.concat(
    [
        count_connects(sdf).rename(columns={0: year})
        for year, sdf in df.groupby("year", as_index=False)
    ],
    axis="columns"
)

Results for your sample dataframe:

        node  connections
year                     
2010      21            3
2010  343252            1
2010   44134            1
2010   41314            1
2011      21            2
2011  161345            1
2011   89479            1

        2010  2011
21       3.0   2.0
41314    1.0   NaN
44134    1.0   NaN
89479    NaN   1.0
161345   NaN   1.0
343252   1.0   NaN

To get a feeling for how long it takes, I've tried the following (1,000 possible ID1s, 10,000 possible ID2s, 2 years, and in sum 2 million rows):

from random import randint
from time import perf_counter

num_nodes_1 = 1_000
num_nodes_2 = 10_000
num_years = 2
start_year = 1999
num_rows_per_year = 1_000_000

df = pd.DataFrame(
    [
        [randint(1, num_nodes_1), randint(1, num_nodes_2), start_year + year]
        for year in range(num_years)
        for _ in range(num_rows_per_year)
    ],
    columns=["ID1", "ID2", "year"]
)
print(df)

start = perf_counter()
df_connects = (
    df.groupby("year").apply(count_connects)
      .reset_index(level=1)
      .rename(columns={"level_1": "node", 0: "connections"})
)
end = perf_counter()
print(f"Duration version 1: {end - start:.2f} seconds")

start = perf_counter()
df_connects = pd.concat(
    [
        count_connects(sdf).rename(columns={0: year})
        for year, sdf in df.groupby("year", as_index=False)
    ],
    axis="columns"
)
end = perf_counter()
print(f"Duration version 2: {end - start:.2f} seconds")

Didn't take too long:

Duration version 1: 10.58 seconds
Duration version 2: 11.06 seconds