I have a dataset X in panda dataframe with about 48000 datapoints. In the dataset here is a feature called gender, 1 representing male and 0 representing female. How do I sample entries from my original dataset? Say I want a new dataset Y with 1000 random datapoint samples from X with 700 males and 300 females? I came up with this simple algorithm but cant figure out why it isn't working
def Sample(X,maleSize,femalesize):
DD=X
for i in range(len(DD)):
if (DD.race[i]==1.0)&(DD.gender.sum()==maleSize):
DD=DD.drop(i)
if (DD.race[i]==0.0) & ((len(DD)-DD.gender.sum())>femalesize):
DD=DD.drop(i)
return DD
Use:
males = X[X['gender']==1].sample(n=700)
females = X[X['gender']==0].sample(n=300)
ndf = males.append(females).sample(frac=1)
Or:
weights = [.7 if x==1 else .3 for x in X['gender']]
X.sample(n=1000, weights = weights)