Search code examples
pythonpandaslambdapandas-groupbysplit-apply-combine

Using groupby with expanding and a custom function


I have a dataframe that consists of truthIds and trackIds:

truthId = ['A', 'A', 'B', 'B', 'C', 'C', 'A', 'C', 'B', 'A', 'A', 'C', 'C']
trackId = [1, 1, 2, 2, 3, 4, 5, 3, 2, 1, 5, 4, 6]
df1 = pd.DataFrame({'truthId': truthId, 'trackId': trackId})
    trackId truthId
0         1       A
1         1       A
2         2       B
3         2       B
4         3       C
5         4       C
6         5       A
7         3       C
8         2       B
9         1       A
10        5       A
11        4       C
12        6       C

I wish to add a column that calculates, for each unique truthId, the length of the set of unique tracksIds that have previously (i.e. from the top of the data to that row) been associated with it:

       truthId  trackId  unique_Ids
0        A        1           1
1        A        1           1
2        B        2           1
3        B        2           1
4        C        3           1
5        C        4           2
6        A        5           2
7        C        3           2
8        B        2           1
9        A        1           2
10       A        5           2
11       C        4           2
12       C        6           3

I am very close to accomplishing this. I can use:

df.groupby('truthId').expanding().agg({'trackId': lambda x: len(set(x))})

Which produces the following output:

                trackId
truthId            
A       0       1.0
        1       1.0
        6       2.0
        9       2.0
        10      2.0
B       2       1.0
        3       1.0
        8       1.0
C       4       1.0
        5       2.0
        7       2.0
        11      2.0
        12      3.0

This is consistent with the documentation

However, it throws an error when I attempt to assign this output to a new column:

df['unique_Ids'] = df.groupby('truthId').expanding().agg({'trackId': lambda x: len(set(x))})

I have used this workflow before and ideally the new column is put back into the original DateFrame with no issues (i.e. Split-Apply-Combine). How can I get it to work?


Solution

  • You need reset_index

    df['Your']=(df.groupby('truthId').expanding().agg({'trackId': lambda x: len(set(x))})).reset_index(level=0,drop=True)
    df
    Out[1162]: 
        trackId truthId  Your
    0         1       A   1.0
    1         1       A   1.0
    2         2       B   1.0
    3         2       B   1.0
    4         3       C   1.0
    5         4       C   2.0
    6         5       A   2.0
    7         3       C   2.0
    8         2       B   1.0
    9         1       A   2.0
    10        5       A   2.0
    11        4       C   2.0
    12        6       C   3.0