Search code examples
pythonpandaspandas-groupbygroupingnetworkx

Group by year in NetworkX to calculate annual number of connections


I have a dataframe of two IDs and year. IDs in the same row mean there is a connection. I want to group by year to calculate total connections for each year for an ID.

I used NetworkX for counting the connections to count considering only ID1 and ID2, but can't figure out how to group by year.

import pandas as pd
import networkx as nx
d = {'ID1': [21, 21, 21, 21, 21], 'ID2': [343252, 44134, 41314, 161345, 89479],'year': [2010, 2010, 2010, 2011, 2011]}
df = pd.DataFrame(data=d)
G = nx.Graph()
G = nx.from_pandas_edgelist(df, 'ID1', 'ID2')
dict = {}
for x in G.nodes:
    dict[x] = len(G[x])
s = pd.Series(dict, name='connections')
df1 = s.to_frame().sort_values('connections', ascending=False)

This gives me the counted connections irrespective of year.

What I want to do is create a graph for each year present in the dataset (data spans 30 years), and calculate connections for that year, and add it to the database. Is there any modification I can add to do this, considering I have a quite large database of variables? I considered creating a loop by year to sort through the data and create a graph for each, but it is quite inefficient since I have millions of rows.


Solution

  • 2 suggestions to modify your approach:

    def count_connects(sdf):
        G = nx.from_pandas_edgelist(sdf, "ID1", "ID2")
        return pd.DataFrame.from_dict(
            {n: len(G[n]) for n in G.nodes}, orient="index"
        )
    
    # Version 1
    df_connects = (
        df.groupby("year").apply(count_connects)
          .reset_index(level=1)
          .rename(columns={"level_1": "node", 0: "connections"})
    )
    
    # Version 2
    df_connects = pd.concat(
        [
            count_connects(sdf).rename(columns={0: year})
            for year, sdf in df.groupby("year", as_index=False)
        ],
        axis="columns"
    )
    

    Results for your sample dataframe:

            node  connections
    year                     
    2010      21            3
    2010  343252            1
    2010   44134            1
    2010   41314            1
    2011      21            2
    2011  161345            1
    2011   89479            1
    
            2010  2011
    21       3.0   2.0
    41314    1.0   NaN
    44134    1.0   NaN
    89479    NaN   1.0
    161345   NaN   1.0
    343252   1.0   NaN
    

    To get a feeling for how long it takes, I've tried the following (1,000 possible ID1s, 10,000 possible ID2s, 2 years, and in sum 2 million rows):

    from random import randint
    from time import perf_counter
    
    num_nodes_1 = 1_000
    num_nodes_2 = 10_000
    num_years = 2
    start_year = 1999
    num_rows_per_year = 1_000_000
    
    df = pd.DataFrame(
        [
            [randint(1, num_nodes_1), randint(1, num_nodes_2), start_year + year]
            for year in range(num_years)
            for _ in range(num_rows_per_year)
        ],
        columns=["ID1", "ID2", "year"]
    )
    print(df)
    
    start = perf_counter()
    df_connects = (
        df.groupby("year").apply(count_connects)
          .reset_index(level=1)
          .rename(columns={"level_1": "node", 0: "connections"})
    )
    end = perf_counter()
    print(f"Duration version 1: {end - start:.2f} seconds")
    
    start = perf_counter()
    df_connects = pd.concat(
        [
            count_connects(sdf).rename(columns={0: year})
            for year, sdf in df.groupby("year", as_index=False)
        ],
        axis="columns"
    )
    end = perf_counter()
    print(f"Duration version 2: {end - start:.2f} seconds")
    

    Didn't take too long:

    Duration version 1: 10.58 seconds
    Duration version 2: 11.06 seconds