Search code examples
pythonrbooleansubsetpanel

r subset stratify dataframe by group; subset the max amount of observations per group as long as true and false boolean are balanced


R subset stratify dataframe by group; subset the max amount of observations per group as long as true and false boolean are balanced: (python answer is also accepted)

I have a dataset of 10000 samples from 600 restaurant IDs, with some of the IDs missing and a biased boolean that I need to balance to 50:50 before I run any models. To recreate the dataset, here is the code

x<-floor(runif(10000, 0, 600)) #make a dataset of 10000 samples from 600 restaurant IDs
x<-sort(x)
y<-sample(0:1,10000,prob=c(.16,.84),replace=TRUE) #make a biased boolean for those 10000 samples
df = data.frame(x,y) #dataframe has random number of restaurants and biased boolean
colnames(df) <- c("Restaurant_ID","Restaurant_Bool")
summary(df)
nrow(df)

z<-floor(runif(10, 0, 600)) #create a 10 restaurants by ID that are missing from the dataset
for (i in 10) {
  df<-df[!(df$Restaurant_ID==z[i]),] #remove those restaurants by ID from the dataset
}
summary(df)
nrow(df)

The ratio for true:false for the dataset is about 84:16, but that number also varies per restaurant ID

Similar to stratification by restaurant ID, I need to limit the amount of true observations to be equal to the number of false observations per restaurant ID

I have no idea how to code this and anything helps

so for example, for restaurant_ID 0, there could be 10 observations, where 8 has true and 2 has false. There is no restaurant_ID 1.

for restaurant_ID 2, there could be 8 observations, where 3 has true and 5 has false.

    X restaurant_ID Restaurant_Bool
    1 0             1
    2 0             1
    3 0             1
    4 0             0
    5 0             1
    6 0             1
    7 0             1
    8 0             0
    9 0             1
   10 0             1
   11 2             0
   12 2             0
   13 2             1
   14 2             0
   15 2             1
   16 2             0
   17 2             1
   18 2             0
   ...

I want a subset result where the number of Restaurant_Bool == 0 is the same as the number of Restaurant_Bool == 1 as long as the maximum number of observations are subsetted based based on the minimum number of boolean observation per restaurant_ID

 X restaurant_ID Restaurant_Bool
 1 0             1
 2 0             1
 4 0             0
 8 0             0
11 2             0
12 2             0
13 2             1
15 2             1
16 2             0
17 2             1
...

This could be the first subset, and another subset could use the other observations to randomly recreate another subset with the same rule:

 X restaurant_ID Restaurant_Bool
 6 0             1
 7 0             1
 4 0             0
 8 0             0
14 2             0
18 2             0
13 2             1
15 2             1
16 2             0
17 2             1
...

...and so on, where multiple different subsets from the same data set could be created by keeping the same sample number of Restaurant_Bool == 1 as Restaurant_Bool == 0 per restaurant_ID

In the rare chance that Restaurant_Bool == 0 has more observations than Restaurant_Bool == 1, then use the least represented boolean to recreate the dataset per restaurant ID, to where a whole restaurant ID could be removed from the dataset if either true or false has no observations

The reason that I want to stratify by restaurant_ID is that there could be some internal correlation with the rest of the columns that I need to preserve when making my model

The closest answer I found is this Subset panel data by group , but does not take into effect that I want to keep the max amount of observations per restaurant_ID as long as true and false boolean are balanced


Solution

  • In python, the code looks like this

    Make a new empty dataset and write a for loop that groups by restaurant_id and find the minimum amount of n per subgroup Restaurant_Bool

    create a catch where if n is 0, then go to the next restaurant_id

    merge recommended and not recommended into a temporary group_reviews dataframe and append group_reviews reviews into the balanced_reviews dataframe while asserting that the mean of Restaurant_Bool is 0.5

    after the loop is over per group, assert that the mean of Restaurant_Bool is 0.5 for the whole dataframe balanced_reviews

    balanced_reviews = pd.DataFrame()
    for restaurant_id, group in reviews.groupby('restaurant_id'):
        take_n = min((group['Restaurant_Bool'] == 0).sum(), (group['Restaurant_Bool'] == 1).sum())
        if take_n == 0:
            continue
        reg_reviews = group[group['Restaurant_Bool'] == 1].sample(n=take_n, random_state=0)
        not_reviews = group[group['Restaurant_Bool'] == 0].sample(n=take_n, random_state=0)
        group_reviews = reg_reviews.append(not_reviews)
    
        assert group_reviews['Restaurant_Bool'].mean() == .5
        balanced_reviews = balanced_reviews.append(group_reviews)
    
    assert balanced_reviews['Restaurant_Bool'].mean() == .5