Search code examples
pandasdataframemerge

How to make a vertical merge and count the differences over time?


I have this dataframe :

   CATEGORY                      DATE    ID
0         A   Sunday, January 1, 2023  Id_1
1         A   Sunday, January 1, 2023  Id_2
2         A   Monday, January 2, 2023  Id_1
3         A   Monday, January 2, 2023  Id_3
4         A   Monday, January 2, 2023  Id_2
5         A  Tuesday, January 3, 2023  Id_4
6         A  Tuesday, January 3, 2023  Id_5
7         B   Sunday, January 1, 2023  Id_5
8         B   Monday, January 2, 2023  Id_2
9         B  Tuesday, January 3, 2023  Id_6
10        B  Tuesday, January 3, 2023  Id_5

DF USED :

import pandas as pd

df = pd.DataFrame({'CATEGORY': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'], 'DATE': ['Sunday, January 1, 2023', 'Sunday, January 1, 2023', 'Monday, January 2, 2023', 'Monday, January 2, 2023', 'Monday, January 2, 2023', 'Tuesday, January 3, 2023', 'Tuesday, January 3, 2023', 'Sunday, January 1, 2023', 'Monday, January 2, 2023', 'Tuesday, January 3, 2023', 'Tuesday, January 3, 2023'], 'ID': ['Id_1', 'Id_2', 'Id_1', 'Id_3', 'Id_2', 'Id_4', 'Id_5', 'Id_5', 'Id_2', 'Id_6', 'Id_5']})

I need for each category to track the number of ids IN (who apperead between two consecutives dates) and the ids OUT (who disappeared between two consecutive dates). I feel like we need to use a merge but I can't find a way to use merge since the data are vertical and not in separate dataframes.

My expected output is this (I draw it manual).

The first date of each category should be excluded because it has no previsou date

enter image description here

Explanation :

in category A:
    Sunday, January 1, 2023 Vs Monday, January 2, 2023
        One id appeared in Monday, January 2, 2023 == Id_3
        Zero ids disappeared in Monday, January 2, 2023
    Monday, January 2, 2023 Vs Tuesday, January 3, 2023
        Two ids appeared in Tuesday, January 3, 2023 == [Id_4, Id_5]
        Three ids disappeared in Tuesday, January 3, 2023 == [Id_1, Id_2, Id_3]

in category B:
     Sunday, January 1, 2023 Vs Monday, January 2, 2023
          One id appeared in Monday, January 2, 2023 == Id_2
          One id disappeared in Monday, January 2, 2023 == Id_5
     Monday, January 2, 2023 Vs Tuesday, January 3, 2023
          Two ids appeared in Tuesday, January 3, 2023 == [Id_6, Id_5]
          One id disappeared in Tuesday, January 3, 2023 == Id_2

I made the code below and I feel like I'm close because I was able to get the right part of the output :

from collections import defaultdict

data = defaultdict(list)
for category, g1 in df.groupby('CATEGORY'):
    list_in = []
    list_out = []
    for id, g2 in g1.groupby('ID'):
        if len(g2) == 1:
            list_in.append(g2['ID'].iloc[0])
        else:
            list_out.append(g2['ID'].iloc[0])
    data[category].append({'IN': list_in, 'OUT': list_out})


{'A': [{'IN': ['Id_3', 'Id_4', 'Id_5'], 'OUT': ['Id_1', 'Id_2']}],
 'B': [{'IN': ['Id_2', 'Id_6'], 'OUT': ['Id_5']}]}

result = pd.DataFrame(data).T[0].apply(pd.Series).applymap(len)
print(result)

   IN  OUT
A   3    2
B   2    1

Can you guys help me complete this code ?


Solution

  • I would use set operations, aggregate as set, compute the difference of size and reshape:

    tmp = (df.assign(DATE=lambda d: pd.to_datetime(d['DATE']))
             .sort_values(by='DATE')
             .groupby(['CATEGORY', 'DATE'])['ID'].agg(set)
          )
    
    IN = tmp.groupby('CATEGORY').diff().str.len().dropna()
    OUT = (tmp.groupby('CATEGORY')
           .transform(lambda s: s.diff(-1).str.len().shift())
           .dropna()
          )
    
    out = (pd
       .concat({'IN': IN, 'OUT': OUT}, axis=1).unstack('DATE')
       .swaplevel(axis=1).sort_index(axis=1)
       .convert_dtypes()
    )
    

    Output:

    DATE         2023-01-02     2023-01-03    
                     IN OUT         IN OUT
    CATEGORY                              
    A                 1   0          2   3
    B                 1   1          2   1