Search code examples
pythonpandasdataframegroup-by

Pandas groupby.head(-n) drops some groups


I have a df such as below:

dataframe

For each reviewerID I want to select 2 last instances (sorted by reviewTime) as test data and the rest (up to 2 last instances) as train dataset. This is the code I have:

df = df.sort_values("reviewTime")
df_train = df.groupby('reviewerID').head(-2).reset_index(drop=True)
df_val = df.groupby('reviewerID').tail(2).reset_index(drop=True)

However, this code drops some groups (reviewerIDs) from the train data. so if I get the number of unique reviewerIDs in the original dataframe and the df_train I will have different numbers:

print(df_train['reviewerID'].nunique())
print(df_val['reviewerID'].nunique())
print(df['reviewerID'].nunique())
776775
777242
777242

Note that the test set has correct values. Now I'm guessing that in cases where my reviewerID has less than 2 rows, pandas is assigning all that to the test df and not the train df. I'm wondering how to get around this issue now, and keep the reviewerIDs with less than 2 values as part of train data instead of Validation. Considering that I don't know how many train data I have per reviewerID.


Solution

  • Since you have several conditions to select your data, use groupby.transform to build a mask of the rows to select and perform boolean indexing:

    g = (df.sort_values('reviewTime')
           .groupby('reviewerID', sort=False)
         )
    
    # number of rows to select
    N = 2
    
    # get size of each group and enumerate rows
    # select groups larger than N
    # and last N rows
    m1 = g.transform('size').gt(N)
    m2 = g.cumcount(ascending=False).lt(N)
    m = m1&m2
    
    df_train = df[~m]
    df_val = df[m]
    

    Example:

    # df
                                overall
    reviewerID asin reviewTime         
    0          0    2024-01-01        0
               1    2024-01-03        1
               2    2024-01-05        2
               3    2024-01-02        3
    1          4    2024-01-01        4
               5    2024-01-02        5
    2          6    2024-01-02        6
               7    2024-01-01        7
               8    2024-01-03        8
    3          9    2024-01-01        9
    
    # df_train
                                overall
    reviewerID asin reviewTime         
    0          0    2024-01-01        0
               3    2024-01-02        3
    1          4    2024-01-01        4
               5    2024-01-02        5
    2          7    2024-01-01        7
    3          9    2024-01-01        9
    
    # df_test
                                overall
    reviewerID asin reviewTime         
    0          1    2024-01-03        1
               2    2024-01-05        2
    2          6    2024-01-02        6
               8    2024-01-03        8
    

    Intermediates:

                                overall  size  cumcount     m1     m2      m
    reviewerID asin reviewTime                                              
    0          0    2024-01-01        0     4         3   True  False  False
               1    2024-01-03        1     4         1   True   True   True
               2    2024-01-05        2     4         0   True   True   True
               3    2024-01-02        3     4         2   True  False  False
    1          4    2024-01-01        4     2         1  False   True  False
               5    2024-01-02        5     2         0  False   True  False
    2          6    2024-01-02        6     3         1   True   True   True
               7    2024-01-01        7     3         2   True  False  False
               8    2024-01-03        8     3         0   True   True   True
    3          9    2024-01-01        9     1         0  False   True  False