Search code examples
pythonpython-3.xpandassampling

Sample with different sample sizes per customer


I have a data frame as such

    Customer   Day
0.    A         1
1.    A         1
2.    A         1
3.    A         2
4.    B         3
5.    B         4

and I want to sample from it but I want to sample different sizes for each customer. I have the size of each customer in another dataframe. For example,

    Customer   Day
0.    A         2
1.    B         1

Suppose I want to sample per customer per day. So far I have this function:

def sampling(frame,a): 
    return np.random.choice(frame.Id,size=a) 

grouped = frame.groupby(['Customer','Day'])
sampled = grouped.apply(sampling, a=??).reset_index()

If I set the size parameter to a global constant, no problem it runs. But I don't know how to set this when the different values are on a separate dataframe.


Solution

  • You can create a mapper from the df1 with sample size and use that value as sample size,

    mapper = df1.set_index('Customer')['Day'].to_dict()
    
    df.groupby('Customer', as_index=False).apply(lambda x: x.sample(n = mapper[x.name]))
    
    
           Customer Day
    0   3   A       2
        2   A       1
    1   4   B       3
    

    This returns multi-index, you can always reset_index,

    df.groupby('Customer').apply(lambda x: x.sample(n = mapper[x.name])).reset_index(drop = True)

        Customer    Day
    0   A           1
    1   A           1
    2   B           3