Search code examples
pythonpandasdataframedataset

How to sample random datapoints from a dataframe


I have a dataset X in panda dataframe with about 48000 datapoints. In the dataset here is a feature called gender, 1 representing male and 0 representing female. How do I sample entries from my original dataset? Say I want a new dataset Y with 1000 random datapoint samples from X with 700 males and 300 females? I came up with this simple algorithm but cant figure out why it isn't working

def Sample(X,maleSize,femalesize):
 DD=X
 for i in  range(len(DD)):
    if (DD.race[i]==1.0)&(DD.gender.sum()==maleSize):
        DD=DD.drop(i)

    if (DD.race[i]==0.0) & ((len(DD)-DD.gender.sum())>femalesize):
           DD=DD.drop(i) 
return DD

Solution

  • Use:

    males = X[X['gender']==1].sample(n=700)
    females = X[X['gender']==0].sample(n=300)
    ndf = males.append(females).sample(frac=1)
    

    Or:

    weights = [.7 if x==1 else .3 for x in X['gender']]
    X.sample(n=1000, weights = weights)