Search code examples
pythonpandassklearn-pandas

Selecting n-elements of each class


I'm using pandas, I have a set of data with about 4 milion of observations. I was wondering what is the best / fastest / the most efficient way to select 50 random elements or first 50 elements for each class (class is just a column).

The unique number of classes in my column is about ~2k, and I would like to select a subset of 100,000 elements, 50 elements for each class.

I was thinking about grouping them into class, then iterating through each group and selecting first 50 elements, then proceeding to next group.

I was wondering is there a better way to do this ?


Solution

  • Given the following dataframe

    df = pd.DataFrame(np.random.rand(100, 2), columns=list('ab'))
    df['group'] = np.remainder(np.random.permutation(len(df)), 3)
    
    df.head()
    
        a           b           group
    0   0.069140    0.553955    1
    1   0.564991    0.699645    2
    2   0.251304    0.516667    2
    3   0.962819    0.314219    2
    4   0.353382    0.500961    0
    

    you can get a randomized version by

    df_randomized = df.ix[np.random.permutation(len(df))]
    
    df_randomized.head()
    
        a           b           group
    90  0.734971    0.895469    0
    35  0.195013    0.566211    0
    27  0.370124    0.870052    2
    21  0.297194    0.500713    1
    66  0.319668    0.347365    2
    

    To select N random elements, first generate the permutation and reduce it in size. After that apply it to the dataframe:

    N = 10
    indexes = np.random.permutation(len(df))[:N]
    df_randomized = df.ix[indexes]
    

    To get the first N elements of each group you can group the dataframe and apply a method to select the first N elements. No need of any loops here as pandas can handle that for you:

    N = 10
    df.groupby('group')\
        .apply(lambda x: x[:N][['a', 'b']])
    

    All of those methods should be fast as they use the internal optimised methods of either numpy or pandas.