Below is a toy panel dataset with panel ID ('id'), time ('time'), value ('value') and some values that will be used as conditions ('cond').
df = pd.DataFrame({'id' : [1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4],
'time' : [1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5],
'value' : [0,1,0,1,1,0,0,1,0,1,1,0,0,1,1,1,0,1,1,1]
})
cond = np.array([['A','B'],['A','C'],['C','D'],['D','E']])
df['cond'] = pd.Series(list(np.repeat(cond, repeats=[5,5,5,5], axis=0)))
print(df)
id time value cond
0 1 1 0 [A, B]
1 1 2 1 [A, B]
2 1 3 0 [A, B]
3 1 4 1 [A, B]
4 1 5 1 [A, B]
5 2 1 0 [A, C]
6 2 2 0 [A, C]
7 2 3 1 [A, C]
8 2 4 0 [A, C]
9 2 5 1 [A, C]
10 3 1 1 [C, D]
11 3 2 0 [C, D]
12 3 3 0 [C, D]
13 3 4 1 [C, D]
14 3 5 1 [C, D]
15 4 1 1 [D, E]
16 4 2 0 [D, E]
17 4 3 1 [D, E]
18 4 4 1 [D, E]
19 4 5 1 [D, E]
Basically what I want is to add a new column showing the sum of the values (in the 'value' column) by time (in the 'time' column), i.e., groupby('time')['value'].transform('sum')
, but one complication is that, for each ID, I want to sum the values of other IDs that have at least one common element in the "cond" column: for example, for id==1
, it will be id==2
(because 'A' is the common element); for id==2
it will be id==1
(because 'A' is common) and id==3
(because 'C' is common).
So, my desired output is shown in the 'cond_sum_by_time' column:
id time value cond cond_sum_by_time
0 1 1 0 [A, B] 0
1 1 2 1 [A, B] 1
2 1 3 0 [A, B] 1
3 1 4 1 [A, B] 1
4 1 5 1 [A, B] 2
5 2 1 0 [A, C] 1
6 2 2 0 [A, C] 1
7 2 3 1 [A, C] 1
8 2 4 0 [A, C] 2
9 2 5 1 [A, C] 3
10 3 1 1 [C, D] 2
11 3 2 0 [C, D] 0
12 3 3 0 [C, D] 2
13 3 4 1 [C, D] 2
14 3 5 1 [C, D] 3
15 4 1 1 [D, E] 2
16 4 2 0 [D, E] 0
17 4 3 1 [D, E] 1
18 4 4 1 [D, E] 2
19 4 5 1 [D, E] 2
I think I can achieve my goal by using, for example, for
loop, but I would like to know if there is a better/more efficient way to do this. Thanks in advance.
UPDATE:
Below is my current code that uses a for
loop:
# to save the desired new dataframe
new_df = pd.DataFrame()
# convert the condition list to set for comparison
df['cond'] = df['cond'].apply(lambda x: set(x))
# only id and condition
id_cond_df = df.groupby('id').last().reset_index()[['id','cond']]
# for each id and its condition...
for i, row in id_cond_df.iterrows():
id = row['id']
cond = row['cond']
# find the row indices in the original dataframe (df) where there is at least one same element in the 'cond' column
idx = df['cond'].apply(lambda x: not x.isdisjoint(cond))
common_df = df[idx].reset_index(drop=True)
# sum by time
common_df['cond_sum_by_time'] = common_df.groupby('time')['value'].transform('sum')
# only the data for the focal id
common_df = common_df.loc[common_df['id']==id]
# store the data in the new dataframe
new_df = pd.concat([new_df, common_df], axis=0).reset_index(drop=True)
new_df['cond'] = new_df['cond'].apply(lambda x: list(x))
My actual dataset is large (e.g., about 20,000 IDs and each ID has 240 time periods), and the above code takes long to run. I would appreciate any suggestions.
Your problem suffers from combinatorial explosion, as the conditions associated with each id have to be compared against those of all other ids. It's hard to do this in less than quadratic time (would love to know if someone has a better algorithm!), but for the size of your data even that should be sufficient as long as the critical operations are vectorised.
To get NumPy to do this fast, I chose to convert your lists of conditions into integers where each bit conveys whether the corresponding condition is met. First we use explode, then crosstab:
>>> df_expl = df.groupby('id').last().explode('cond') # Un-nest lists
>>> df_expl
time value cond
id
1 5 1 A
1 5 1 B
2 5 1 A
2 5 1 C
3 5 1 C
3 5 1 D
4 5 1 D
4 5 1 E
>>> cross = pd.crosstab(df_expl.index, df_expl.cond) # Convert to flags
cond A B C D E
row_0
1 1 1 0 0 0
2 1 0 1 0 0
3 0 0 1 1 0
4 0 0 0 1 1
>>> conds = (2**np.arange(cross.shape[1])).dot(cross.values.T) # Combine flags as bits
array([ 3, 5, 12, 24], dtype=int64)
Now we can summarise the relationships between the conditions of different ids with a simple binary matrix:
>>> common = conds & conds[:, None] != 0
>>> common
array([[ True, True, False, False],
[ True, True, True, False],
[False, True, True, True],
[False, False, True, True]])
What remains is just to loop over all time steps (luckily the smaller dimension, in your case) and use matrix multiplication to sum the appropriate values together.
As a complete function:
def conditional_sum(df):
df_expl = df.groupby('id').last().explode('cond')
cross = pd.crosstab(df_expl.index, df_expl.cond)
conds = (2**np.arange(cross.shape[1])).dot(cross.values.T)
common = conds & conds[:, None] != 0
df['cond_sum_by_time'] = 0
for time in df['time'].unique():
time_vals = df[df['time'] == time]['value'].values
df.loc[df['time'] == time, 'cond_sum_by_time'] = time_vals.dot(common)
return df
Some timings for intermediate sizes of dataframes (orig
is your implementation):
In []: %timeit orig(df_small) # 500 ids, 50 time steps
5.8 s ± 73.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In []: %timeit conditional_sum(df_small)
103 ms ± 2.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In []: %timeit conditional_sum(df_medium) # 2000 ids, 100 time steps
2.69 s ± 32.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
When I generate a dataframe with 20,000 ids and 240 time steps, the function completes in around 16.5min on my machine (for comparison, your original implementation was estimated to take around 10hr20min):
In []: %timeit conditional_sum(df_large) # 20000 ids, 240 time steps
16min 33s ± 20.1 s per loop (mean ± std. dev. of 7 runs, 1 loop each)