Search code examples
pandasdataframearray-difference

Identifying differences between groups in pandas dataframe


I have a pandas dataframe indexed by date and and ID. I would like to:

  1. Identify the ID of additions and deletions between dates
  2. Add the ID to another dataframe with the date of the addition/deletion.

 

date        ID   value
12/31/2010  13  -0.124409
             9   0.555959
             1  -0.705634
             2  -3.123603
             4   0.725009
1/31/2011   13   0.471078
             9   0.276006
             1  -0.468463
            22   1.076821
            11   0.668599

Desired output:

date        ID  flag
1/31/2011   22  addition
1/31/2011   11  addition
1/31/2011   2   deletion
1/31/2011   4   deletion

I have tried Diff between two dataframes in pandas . I cannot get this to work on a grouped dataframe. I am unsure how to loop over each group, and compare to the previous group.


Solution

  • I created a helper function that shifts the first level of a pandas.MultiIndex. With this, I can difference it with the original index to determine additions and deletions.

    def shift_level(idx):
        level = idx.levels[0]
        mapping = dict(zip(level[:-1], level[1:]))
        idx = idx.set_levels(level.map(mapping.get), 0)
        return idx[idx.get_level_values(0).notna()].remove_unused_levels()
    
    idx = df.index
    fidx = shift_level(idx)
    
    additions = fidx.difference(idx)
    deletions = idx[idx.labels[0] > 0].difference(fidx)
    
    pd.Series('+', additions).append(
        pd.Series('-', deletions)).rename('flag').reset_index()
    
            date  ID flag
    0 2011-01-31   2    +
    1 2011-01-31   4    +
    2 2011-01-31  11    -
    3 2011-01-31  22    -