I'm using pandas, I have a set of data with about 4 milion of observations. I was wondering what is the best / fastest / the most efficient way to select 50 random elements or first 50 elements for each class (class is just a column).
The unique number of classes in my column is about ~2k, and I would like to select a subset of 100,000 elements, 50 elements for each class.
I was thinking about grouping them into class, then iterating through each group and selecting first 50 elements, then proceeding to next group.
I was wondering is there a better way to do this ?
Given the following dataframe
df = pd.DataFrame(np.random.rand(100, 2), columns=list('ab'))
df['group'] = np.remainder(np.random.permutation(len(df)), 3)
df.head()
a b group
0 0.069140 0.553955 1
1 0.564991 0.699645 2
2 0.251304 0.516667 2
3 0.962819 0.314219 2
4 0.353382 0.500961 0
you can get a randomized version by
df_randomized = df.ix[np.random.permutation(len(df))]
df_randomized.head()
a b group
90 0.734971 0.895469 0
35 0.195013 0.566211 0
27 0.370124 0.870052 2
21 0.297194 0.500713 1
66 0.319668 0.347365 2
To select N random elements, first generate the permutation and reduce it in size. After that apply it to the dataframe:
N = 10
indexes = np.random.permutation(len(df))[:N]
df_randomized = df.ix[indexes]
To get the first N elements of each group you can group the dataframe and apply a method to select the first N elements. No need of any loops here as pandas can handle that for you:
N = 10
df.groupby('group')\
.apply(lambda x: x[:N][['a', 'b']])
All of those methods should be fast as they use the internal optimised methods of either numpy or pandas.