Pandas groupby.head(-n) drops some groups

I have a df such as below:

For each reviewerID I want to select 2 last instances (sorted by reviewTime) as test data and the rest (up to 2 last instances) as train dataset. This is the code I have:

df = df.sort_values("reviewTime")
df_train = df.groupby('reviewerID').head(-2).reset_index(drop=True)
df_val = df.groupby('reviewerID').tail(2).reset_index(drop=True)

However, this code drops some groups (reviewerIDs) from the train data. so if I get the number of unique reviewerIDs in the original dataframe and the df_train I will have different numbers:

print(df_train['reviewerID'].nunique())
print(df_val['reviewerID'].nunique())
print(df['reviewerID'].nunique())

776775
777242
777242

Note that the test set has correct values. Now I'm guessing that in cases where my reviewerID has less than 2 rows, pandas is assigning all that to the test df and not the train df. I'm wondering how to get around this issue now, and keep the reviewerIDs with less than 2 values as part of train data instead of Validation. Considering that I don't know how many train data I have per reviewerID.

Solution

Since you have several conditions to select your data, use groupby.transform to build a mask of the rows to select and perform boolean indexing:

g = (df.sort_values('reviewTime')
       .groupby('reviewerID', sort=False)
     )

# number of rows to select
N = 2

# get size of each group and enumerate rows
# select groups larger than N
# and last N rows
m1 = g.transform('size').gt(N)
m2 = g.cumcount(ascending=False).lt(N)
m = m1&m2

df_train = df[~m]
df_val = df[m]

Example:

# df
                            overall
reviewerID asin reviewTime         
0          0    2024-01-01        0
           1    2024-01-03        1
           2    2024-01-05        2
           3    2024-01-02        3
1          4    2024-01-01        4
           5    2024-01-02        5
2          6    2024-01-02        6
           7    2024-01-01        7
           8    2024-01-03        8
3          9    2024-01-01        9

# df_train
                            overall
reviewerID asin reviewTime         
0          0    2024-01-01        0
           3    2024-01-02        3
1          4    2024-01-01        4
           5    2024-01-02        5
2          7    2024-01-01        7
3          9    2024-01-01        9

# df_test
                            overall
reviewerID asin reviewTime         
0          1    2024-01-03        1
           2    2024-01-05        2
2          6    2024-01-02        6
           8    2024-01-03        8

Intermediates:

                            overall  size  cumcount     m1     m2      m
reviewerID asin reviewTime                                              
0          0    2024-01-01        0     4         3   True  False  False
           1    2024-01-03        1     4         1   True   True   True
           2    2024-01-05        2     4         0   True   True   True
           3    2024-01-02        3     4         2   True  False  False
1          4    2024-01-01        4     2         1  False   True  False
           5    2024-01-02        5     2         0  False   True  False
2          6    2024-01-02        6     3         1   True   True   True
           7    2024-01-01        7     3         2   True  False  False
           8    2024-01-03        8     3         0   True   True   True
3          9    2024-01-01        9     1         0  False   True  False