I have a df such as below:
For each reviewerID I want to select 2 last instances (sorted by reviewTime) as test data and the rest (up to 2 last instances) as train dataset. This is the code I have:
df = df.sort_values("reviewTime")
df_train = df.groupby('reviewerID').head(-2).reset_index(drop=True)
df_val = df.groupby('reviewerID').tail(2).reset_index(drop=True)
However, this code drops some groups (reviewerIDs) from the train data. so if I get the number of unique reviewerIDs in the original dataframe and the df_train I will have different numbers:
print(df_train['reviewerID'].nunique())
print(df_val['reviewerID'].nunique())
print(df['reviewerID'].nunique())
776775
777242
777242
Note that the test set has correct values. Now I'm guessing that in cases where my reviewerID has less than 2 rows, pandas is assigning all that to the test df and not the train df. I'm wondering how to get around this issue now, and keep the reviewerIDs with less than 2 values as part of train data instead of Validation. Considering that I don't know how many train data I have per reviewerID.
Since you have several conditions to select your data, use groupby.transform
to build a mask of the rows to select and perform boolean indexing:
g = (df.sort_values('reviewTime')
.groupby('reviewerID', sort=False)
)
# number of rows to select
N = 2
# get size of each group and enumerate rows
# select groups larger than N
# and last N rows
m1 = g.transform('size').gt(N)
m2 = g.cumcount(ascending=False).lt(N)
m = m1&m2
df_train = df[~m]
df_val = df[m]
Example:
# df
overall
reviewerID asin reviewTime
0 0 2024-01-01 0
1 2024-01-03 1
2 2024-01-05 2
3 2024-01-02 3
1 4 2024-01-01 4
5 2024-01-02 5
2 6 2024-01-02 6
7 2024-01-01 7
8 2024-01-03 8
3 9 2024-01-01 9
# df_train
overall
reviewerID asin reviewTime
0 0 2024-01-01 0
3 2024-01-02 3
1 4 2024-01-01 4
5 2024-01-02 5
2 7 2024-01-01 7
3 9 2024-01-01 9
# df_test
overall
reviewerID asin reviewTime
0 1 2024-01-03 1
2 2024-01-05 2
2 6 2024-01-02 6
8 2024-01-03 8
Intermediates:
overall size cumcount m1 m2 m
reviewerID asin reviewTime
0 0 2024-01-01 0 4 3 True False False
1 2024-01-03 1 4 1 True True True
2 2024-01-05 2 4 0 True True True
3 2024-01-02 3 4 2 True False False
1 4 2024-01-01 4 2 1 False True False
5 2024-01-02 5 2 0 False True False
2 6 2024-01-02 6 3 1 True True True
7 2024-01-01 7 3 2 True False False
8 2024-01-03 8 3 0 True True True
3 9 2024-01-01 9 1 0 False True False