Search code examples
pythonloopsdataframeunique

Pandas:Count accumulated unique values based on another column


I have a simple data frame with IDs and date, like below:

    'ID Date
     a  2009/12/1
     c  2009/12/1
     d  2009/12/1
     a  2010/4/1
     c  2010/5/1
     e  2010/5/1
     b  2010/12/1
     b  2012/3/1
     e  2012/7/1
     b  2013/1/1
    ...
    ...'

I need to count unique values by each month and accumulate them but not counting existing IDs. For instance

    `2009/12/1 3
     2010/4/1  3
     2010/5/1  4
     ... ...`

I created a loop but not working

       `for d in df['date'].drop_duplicates():
        c=df[df['date']<=d].ID.nunique()
        df2=DataFrame(data=c,index=d)`

Can anyone tell me where is the problem? thanks


Solution

  • One option is to write a for loop and use a set to hold the cumulative unique IDs:

    cumcount = []
    cumunique = set()
    date = []
    for k, g in df.groupby(pd.to_datetime(df.Date)):
        cumunique |= set(g.ID)                  # hold cumulative unique IDs
        date.append(g.Date.iat[0])              # get the date variable for each group
        cumcount.append(len(cumunique))         # hold cumulative count of unique IDs
    
    pd.DataFrame({"Date": date, "ID": cumcount})
    

    enter image description here