I have a dataframe of two IDs and year. IDs in the same row mean there is a connection. I want to group by year to calculate total connections for each year for an ID.
I used NetworkX for counting the connections to count considering only ID1 and ID2, but can't figure out how to group by year.
import pandas as pd
import networkx as nx
d = {'ID1': [21, 21, 21, 21, 21], 'ID2': [343252, 44134, 41314, 161345, 89479],'year': [2010, 2010, 2010, 2011, 2011]}
df = pd.DataFrame(data=d)
G = nx.Graph()
G = nx.from_pandas_edgelist(df, 'ID1', 'ID2')
dict = {}
for x in G.nodes:
dict[x] = len(G[x])
s = pd.Series(dict, name='connections')
df1 = s.to_frame().sort_values('connections', ascending=False)
This gives me the counted connections irrespective of year.
What I want to do is create a graph for each year present in the dataset (data spans 30 years), and calculate connections for that year, and add it to the database. Is there any modification I can add to do this, considering I have a quite large database of variables? I considered creating a loop by year to sort through the data and create a graph for each, but it is quite inefficient since I have millions of rows.
2 suggestions to modify your approach:
def count_connects(sdf):
G = nx.from_pandas_edgelist(sdf, "ID1", "ID2")
return pd.DataFrame.from_dict(
{n: len(G[n]) for n in G.nodes}, orient="index"
)
# Version 1
df_connects = (
df.groupby("year").apply(count_connects)
.reset_index(level=1)
.rename(columns={"level_1": "node", 0: "connections"})
)
# Version 2
df_connects = pd.concat(
[
count_connects(sdf).rename(columns={0: year})
for year, sdf in df.groupby("year", as_index=False)
],
axis="columns"
)
Results for your sample dataframe:
node connections
year
2010 21 3
2010 343252 1
2010 44134 1
2010 41314 1
2011 21 2
2011 161345 1
2011 89479 1
2010 2011
21 3.0 2.0
41314 1.0 NaN
44134 1.0 NaN
89479 NaN 1.0
161345 NaN 1.0
343252 1.0 NaN
To get a feeling for how long it takes, I've tried the following (1,000 possible ID1
s, 10,000 possible ID2
s, 2 years, and in sum 2 million rows):
from random import randint
from time import perf_counter
num_nodes_1 = 1_000
num_nodes_2 = 10_000
num_years = 2
start_year = 1999
num_rows_per_year = 1_000_000
df = pd.DataFrame(
[
[randint(1, num_nodes_1), randint(1, num_nodes_2), start_year + year]
for year in range(num_years)
for _ in range(num_rows_per_year)
],
columns=["ID1", "ID2", "year"]
)
print(df)
start = perf_counter()
df_connects = (
df.groupby("year").apply(count_connects)
.reset_index(level=1)
.rename(columns={"level_1": "node", 0: "connections"})
)
end = perf_counter()
print(f"Duration version 1: {end - start:.2f} seconds")
start = perf_counter()
df_connects = pd.concat(
[
count_connects(sdf).rename(columns={0: year})
for year, sdf in df.groupby("year", as_index=False)
],
axis="columns"
)
end = perf_counter()
print(f"Duration version 2: {end - start:.2f} seconds")
Didn't take too long:
Duration version 1: 10.58 seconds
Duration version 2: 11.06 seconds