Search code examples
pythonpandasbindownsampling

Conditional downsampling over a data frame


I am working on a data frame that looks like this:

Id  feat1 value
c1   c22     51
c2   c12     83
c3   d31     42
c4   a19     110
c5   d44     56
.     .       .
.     .       .
.     .       .

The value column has a range [40,240]. I want to downsample the dataframe such that I get 300 rows for each of the following bins: [40-50,50-60,60-70,70-80,80-90,90-100,100-110....]


Solution

  • You can create bins using pandas.cut(), then groupby bins to draw equal samples per bin

    df['bin'] = pd.cut(df['value'], range(40, 250, 10))
    sampled_df = df.groupby('bin').apply(lambda x: x.sample(300)).reset_index(drop=True)