Search code examples
pythonpandasdataframerandomsample

Python Pandas - Sample certain number of individuals from binned data


Here is a dummy example of the DF I'm working with. It effectively comprises binned data, where the first column gives a category and the second column the number of individuals in that category.

df = pd.DataFrame(data={'Category':['A','B','C','D','E','F','G','H','I'],
                    'Count':[1000,200,850,350,4000,20,35,4585,2],})

Picture of df

I want to take a random sample, say of 100 individuals, from these data. So for example my random sample could be:

sample1 = pd.DataFrame(data={'Category':['A','B','C','D','E','F','G','H','I'],
                    'Count':[15,2,4,4,35,0,15,25,0],})

Picture of sample1

I.e. the sample cannot contain more individuals than are actually in any of the categories. Sampling 0 individuals from a category is possible (and more likely for categories with a lower Count).

How could I go about doing this? I feel like there must be a simple answer but I can't think of it!

Thank you in advance!


Solution

  • You can try sample with replacement:

    df.sample(n=100, replace=True, weights=df.Count).groupby(by='Category').count()