I have two dataframes. One of them has session ids and their cut-off points. The other dataframe has multiple rows for each session and I want to take first n rows of each session and n is the cut-off point from the other dataframe. This is a screenshot of two dataframes.
For example session 0 has 20 rows and session 1 has 50 rows. Cut-off index for session 0 is 10 and it is 30 for session 1. I want to do a groupby or any vectorized operation which takes first 10 rows of session 0 and first 30 rows of session 1.
Is it possible without looping?
An example:
import numpy as np
import pandas as pd
# Sample data:
df = pd.DataFrame({
"session": np.repeat(np.arange(5), 4),
"data": np.arange(20)
})
# Define the cutoffs for each session:
cutoffs = [3, 2, 4, 2, 1]
# Or use a dict: session -> cutoff
out = df.groupby("session").apply(lambda x: x.head(cutoffs[x.name]))
# x.name is the current session of whatever group is being worked on
out:
session data
session
0 0 0 0
1 0 1
2 0 2
1 4 1 4
5 1 5
2 8 2 8
9 2 9
10 2 10
11 2 11
3 12 3 12
13 3 13
4 16 4 16
The second level of the index is the original index; you can optionally drop it using .droplevel(1)