I have this dataframe :
CATEGORY DATE ID
0 A Sunday, January 1, 2023 Id_1
1 A Sunday, January 1, 2023 Id_2
2 A Monday, January 2, 2023 Id_1
3 A Monday, January 2, 2023 Id_3
4 A Monday, January 2, 2023 Id_2
5 A Tuesday, January 3, 2023 Id_4
6 A Tuesday, January 3, 2023 Id_5
7 B Sunday, January 1, 2023 Id_5
8 B Monday, January 2, 2023 Id_2
9 B Tuesday, January 3, 2023 Id_6
10 B Tuesday, January 3, 2023 Id_5
DF USED :
import pandas as pd
df = pd.DataFrame({'CATEGORY': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'], 'DATE': ['Sunday, January 1, 2023', 'Sunday, January 1, 2023', 'Monday, January 2, 2023', 'Monday, January 2, 2023', 'Monday, January 2, 2023', 'Tuesday, January 3, 2023', 'Tuesday, January 3, 2023', 'Sunday, January 1, 2023', 'Monday, January 2, 2023', 'Tuesday, January 3, 2023', 'Tuesday, January 3, 2023'], 'ID': ['Id_1', 'Id_2', 'Id_1', 'Id_3', 'Id_2', 'Id_4', 'Id_5', 'Id_5', 'Id_2', 'Id_6', 'Id_5']})
I need for each category to track the number of ids IN
(who apperead between two consecutives dates) and the ids OUT
(who disappeared between two consecutive dates). I feel like we need to use a merge but I can't find a way to use merge
since the data are vertical and not in separate dataframes.
My expected output is this (I draw it manual).
The first date of each category should be excluded because it has no previsou date
Explanation :
in category A:
Sunday, January 1, 2023 Vs Monday, January 2, 2023
One id appeared in Monday, January 2, 2023 == Id_3
Zero ids disappeared in Monday, January 2, 2023
Monday, January 2, 2023 Vs Tuesday, January 3, 2023
Two ids appeared in Tuesday, January 3, 2023 == [Id_4, Id_5]
Three ids disappeared in Tuesday, January 3, 2023 == [Id_1, Id_2, Id_3]
in category B:
Sunday, January 1, 2023 Vs Monday, January 2, 2023
One id appeared in Monday, January 2, 2023 == Id_2
One id disappeared in Monday, January 2, 2023 == Id_5
Monday, January 2, 2023 Vs Tuesday, January 3, 2023
Two ids appeared in Tuesday, January 3, 2023 == [Id_6, Id_5]
One id disappeared in Tuesday, January 3, 2023 == Id_2
I made the code below and I feel like I'm close because I was able to get the right part of the output :
from collections import defaultdict
data = defaultdict(list)
for category, g1 in df.groupby('CATEGORY'):
list_in = []
list_out = []
for id, g2 in g1.groupby('ID'):
if len(g2) == 1:
list_in.append(g2['ID'].iloc[0])
else:
list_out.append(g2['ID'].iloc[0])
data[category].append({'IN': list_in, 'OUT': list_out})
{'A': [{'IN': ['Id_3', 'Id_4', 'Id_5'], 'OUT': ['Id_1', 'Id_2']}],
'B': [{'IN': ['Id_2', 'Id_6'], 'OUT': ['Id_5']}]}
result = pd.DataFrame(data).T[0].apply(pd.Series).applymap(len)
print(result)
IN OUT
A 3 2
B 2 1
Can you guys help me complete this code ?
I would use set
operations, aggregate as set
, compute the difference of size and reshape:
tmp = (df.assign(DATE=lambda d: pd.to_datetime(d['DATE']))
.sort_values(by='DATE')
.groupby(['CATEGORY', 'DATE'])['ID'].agg(set)
)
IN = tmp.groupby('CATEGORY').diff().str.len().dropna()
OUT = (tmp.groupby('CATEGORY')
.transform(lambda s: s.diff(-1).str.len().shift())
.dropna()
)
out = (pd
.concat({'IN': IN, 'OUT': OUT}, axis=1).unstack('DATE')
.swaplevel(axis=1).sort_index(axis=1)
.convert_dtypes()
)
Output:
DATE 2023-01-02 2023-01-03
IN OUT IN OUT
CATEGORY
A 1 0 2 3
B 1 1 2 1