R subset stratify dataframe by group; subset the max amount of observations per group as long as true and false boolean are balanced: (python answer is also accepted)
I have a dataset of 10000 samples from 600 restaurant IDs, with some of the IDs missing and a biased boolean that I need to balance to 50:50 before I run any models. To recreate the dataset, here is the code
x<-floor(runif(10000, 0, 600)) #make a dataset of 10000 samples from 600 restaurant IDs
x<-sort(x)
y<-sample(0:1,10000,prob=c(.16,.84),replace=TRUE) #make a biased boolean for those 10000 samples
df = data.frame(x,y) #dataframe has random number of restaurants and biased boolean
colnames(df) <- c("Restaurant_ID","Restaurant_Bool")
summary(df)
nrow(df)
z<-floor(runif(10, 0, 600)) #create a 10 restaurants by ID that are missing from the dataset
for (i in 10) {
df<-df[!(df$Restaurant_ID==z[i]),] #remove those restaurants by ID from the dataset
}
summary(df)
nrow(df)
The ratio for true:false for the dataset is about 84:16, but that number also varies per restaurant ID
Similar to stratification by restaurant ID, I need to limit the amount of true observations to be equal to the number of false observations per restaurant ID
I have no idea how to code this and anything helps
so for example, for restaurant_ID 0, there could be 10 observations, where 8 has true and 2 has false. There is no restaurant_ID 1.
for restaurant_ID 2, there could be 8 observations, where 3 has true and 5 has false.
X restaurant_ID Restaurant_Bool
1 0 1
2 0 1
3 0 1
4 0 0
5 0 1
6 0 1
7 0 1
8 0 0
9 0 1
10 0 1
11 2 0
12 2 0
13 2 1
14 2 0
15 2 1
16 2 0
17 2 1
18 2 0
...
I want a subset result where the number of Restaurant_Bool == 0 is the same as the number of Restaurant_Bool == 1 as long as the maximum number of observations are subsetted based based on the minimum number of boolean observation per restaurant_ID
X restaurant_ID Restaurant_Bool
1 0 1
2 0 1
4 0 0
8 0 0
11 2 0
12 2 0
13 2 1
15 2 1
16 2 0
17 2 1
...
This could be the first subset, and another subset could use the other observations to randomly recreate another subset with the same rule:
X restaurant_ID Restaurant_Bool
6 0 1
7 0 1
4 0 0
8 0 0
14 2 0
18 2 0
13 2 1
15 2 1
16 2 0
17 2 1
...
...and so on, where multiple different subsets from the same data set could be created by keeping the same sample number of Restaurant_Bool == 1 as Restaurant_Bool == 0 per restaurant_ID
In the rare chance that Restaurant_Bool == 0 has more observations than Restaurant_Bool == 1, then use the least represented boolean to recreate the dataset per restaurant ID, to where a whole restaurant ID could be removed from the dataset if either true or false has no observations
The reason that I want to stratify by restaurant_ID is that there could be some internal correlation with the rest of the columns that I need to preserve when making my model
The closest answer I found is this Subset panel data by group , but does not take into effect that I want to keep the max amount of observations per restaurant_ID as long as true and false boolean are balanced
In python, the code looks like this
Make a new empty dataset and write a for loop that groups by restaurant_id and find the minimum amount of n per subgroup Restaurant_Bool
create a catch where if n is 0, then go to the next restaurant_id
merge recommended and not recommended into a temporary group_reviews dataframe and append group_reviews reviews into the balanced_reviews dataframe while asserting that the mean of Restaurant_Bool is 0.5
after the loop is over per group, assert that the mean of Restaurant_Bool is 0.5 for the whole dataframe balanced_reviews
balanced_reviews = pd.DataFrame()
for restaurant_id, group in reviews.groupby('restaurant_id'):
take_n = min((group['Restaurant_Bool'] == 0).sum(), (group['Restaurant_Bool'] == 1).sum())
if take_n == 0:
continue
reg_reviews = group[group['Restaurant_Bool'] == 1].sample(n=take_n, random_state=0)
not_reviews = group[group['Restaurant_Bool'] == 0].sample(n=take_n, random_state=0)
group_reviews = reg_reviews.append(not_reviews)
assert group_reviews['Restaurant_Bool'].mean() == .5
balanced_reviews = balanced_reviews.append(group_reviews)
assert balanced_reviews['Restaurant_Bool'].mean() == .5