Search code examples
pythonpandassamplekeyerror

Sample Pandas based on dictionary


I'm trying to sample a pandas DataFrame based on a dictionary and a specific column. So for each value of y column, I know exactly how many observations I would like to pick.

I can do this via a groupby apply combo as such:

import pandas as pd

df = pd.DataFrame({'y': [2,2,0,0,0,1,1,1,1,1], 'x': 1, 'z': 2})

    y   x   z
0   2   1   2
1   2   1   2
2   0   1   2
3   0   1   2
4   0   1   2
5   1   1   2

sizes = {0: 2, 1: 1, 2:1}

df.groupby('y').apply(lambda x: x.sample(sizes[x['y'].values[0]]))

y y x z

0 2 0 1 2 4 0 1 2 1 5 1 1 2 2 0 2 1 2

However, if I do unique instead of values (which should be equivavelent, I get a weird KeyError: 'y' error on the dataframe:

df.groupby('y').apply(lambda x: x.sample(sizes[x.y.unique()[0]]))

Can someone explain why this is happening?

EDIT:

This happened on 0.23.1 but not on 0.23.1 so this was probably a bug.


Solution

  • I think you need .name attribute:

    df1 = df.groupby('y').apply(lambda x: x.sample(sizes[x.name]))
    print (df1)
    
         y  x  z
    y           
    0 4  0  1  2
      2  0  1  2
    1 6  1  1  2
    2 0  2  1  2
    

    If possible some value not match in dictionary use get with 0 for not matched values:

    df1 = df.groupby('y').apply(lambda x: x.sample(sizes.get(x.name, 0)))
    

    EDIT:

    Problem is unique return one element numpy array:

    def f(x):
        print (x['y'].unique())
        print (x['y'].unique()[0])
        print (sizes[x['y'].unique()[0]])
        print (x.sample(sizes[x['y'].unique()[0]]))
    
    df1 = df.groupby('y').apply(f)
    
    [0]
    0
    2
       y  x  z
    2  0  1  2
    4  0  1  2
    [0]
    0
    2
       y  x  z
    4  0  1  2
    2  0  1  2
    [1]
    1
    1
       y  x  z
    6  1  1  2
    [2]
    2
    1
       y  x  z
    0  2  1  2
    

    df1 = df.groupby('y').apply(lambda x: x.sample(sizes[x.y.unique()[0]]))
    print (df1)
         y  x  z
    y           
    0 4  0  1  2
      2  0  1  2
    1 6  1  1  2
    2 0  2  1  2