Search code examples
pythonpandasnumpygroup-by

pandas - DataFrame.groupby.head with different values


I have two dataframes. One of them has session ids and their cut-off points. The other dataframe has multiple rows for each session and I want to take first n rows of each session and n is the cut-off point from the other dataframe. This is a screenshot of two dataframes.

enter image description here

For example session 0 has 20 rows and session 1 has 50 rows. Cut-off index for session 0 is 10 and it is 30 for session 1. I want to do a groupby or any vectorized operation which takes first 10 rows of session 0 and first 30 rows of session 1.

Is it possible without looping?


Solution

  • An example:

    import numpy as np
    import pandas as pd
    
    # Sample data:
    df = pd.DataFrame({
        "session": np.repeat(np.arange(5), 4),
        "data": np.arange(20)
    })
    
    # Define the cutoffs for each session:
    cutoffs = [3, 2, 4, 2, 1]
    # Or use a dict: session -> cutoff
    
    out = df.groupby("session").apply(lambda x: x.head(cutoffs[x.name]))
    # x.name is the current session of whatever group is being worked on
    

    out:

                session  data
    session
    0       0         0     0
            1         0     1
            2         0     2
    1       4         1     4
            5         1     5
    2       8         2     8
            9         2     9
            10        2    10
            11        2    11
    3       12        3    12
            13        3    13
    4       16        4    16
    

    The second level of the index is the original index; you can optionally drop it using .droplevel(1)