Similar to this post, I am removing rows with >50% missing data to get a more reliable and complete dataset
# Keep only the rows with at least x% non-NA values
# calculate threshold
numOfFeatures=38 # num of features in dataset
x=round(numOfFeatures*0.5) #50%
dfWithDroppedRows = df.dropna(thresh=x)
However, I have a imbalanced dataset, where the majority class makes up almost 93% of my dataset
df['y'].value_counts(normalize=True) * 100
No 92.769441
Yes 7.230559
Therefore, I want to edit the above code to remove rows with >50% missing data from sample from the majority class only, so I do not lose samples from the minority class.
To do this, I have tried:
dfWithDroppedRows = df[df['y'] == 'No'].dropna(thresh=x)
But this results in dfWithDroppedRows containing only reduced rows from majority class and not the samples from the minority class. I suppose I can fix this by concatenating dfWithDroppedRows with df[df['y'] == 'Yes'] but I feel there should be a more straightforward way of doing this. Any suggestions?
In my opinion, concatenating the two DataFrames is not such a bad idea, but if you do not like it, here's my suggestion.
mask_majority = df.eval("y == 'No'")
mask_missing = df.isna().sum(axis="columns") >= x
import numpy as np
mask_drop = np.logical_and(mask_majority, mask_missing)
mask_keep = np.logical_not(mask_drop)
dfWithDroppedRows = df.loc[mask_keep, :]
Basically, I create a mask for the majority class, and a mask for all rows with more than x
missing values.
Then I combine the two masks to get a mask with all the rows that must not be dropped, and I get the DataFrame containing only those rows using .loc
.
By the way, if you decide to use the initial solution of concatenating the two DataFrame, I would use query
method instead, it is more idiomatic:
df_majority_droppedRows = df.query("y == 'No'").dropna(thresh=x)
df_minority = df.query("y == 'Yes'")
dfWithDroppedRows = pd.concat([df_majority_droppedRows, df_minority])