Search code examples
pythonpandasdataframeimbalanced-data

Remove rows with more than percentage of missing data for majority class samples only


Similar to this post, I am removing rows with >50% missing data to get a more reliable and complete dataset

# Keep only the rows with at least x% non-NA values

# calculate threshold
numOfFeatures=38 # num of features in dataset
x=round(numOfFeatures*0.5) #50%

dfWithDroppedRows = df.dropna(thresh=x)

However, I have a imbalanced dataset, where the majority class makes up almost 93% of my dataset

df['y'].value_counts(normalize=True) * 100
No     92.769441
Yes     7.230559

Therefore, I want to edit the above code to remove rows with >50% missing data from sample from the majority class only, so I do not lose samples from the minority class.

To do this, I have tried:

dfWithDroppedRows = df[df['y'] == 'No'].dropna(thresh=x)

But this results in dfWithDroppedRows containing only reduced rows from majority class and not the samples from the minority class. I suppose I can fix this by concatenating dfWithDroppedRows with df[df['y'] == 'Yes'] but I feel there should be a more straightforward way of doing this. Any suggestions?


Solution

  • In my opinion, concatenating the two DataFrames is not such a bad idea, but if you do not like it, here's my suggestion.

    mask_majority = df.eval("y == 'No'")
    mask_missing = df.isna().sum(axis="columns") >= x
    
    import numpy as np
    
    mask_drop = np.logical_and(mask_majority, mask_missing)
    mask_keep = np.logical_not(mask_drop)
    dfWithDroppedRows = df.loc[mask_keep, :]
    

    Basically, I create a mask for the majority class, and a mask for all rows with more than x missing values. Then I combine the two masks to get a mask with all the rows that must not be dropped, and I get the DataFrame containing only those rows using .loc.

    By the way, if you decide to use the initial solution of concatenating the two DataFrame, I would use query method instead, it is more idiomatic:

    df_majority_droppedRows = df.query("y == 'No'").dropna(thresh=x)
    df_minority = df.query("y == 'Yes'")
    dfWithDroppedRows = pd.concat([df_majority_droppedRows, df_minority])