Search code examples
pandasdata-cleaning

Fill missing Values by a ratio of other values in Pandas


I have a column in a Dataframe in Pandas with around 78% missing values.

The remaining 22% values are divided between three labels - SC, ST, GEN with the following ratios.

SC - 16% ST - 8% GEN - 76%

I need to replace the missing values by the above three values so that the ratio of all the elements remains same as above. The assignment can be random as long the the ratio remains as above.

How do I accomplish this?


Solution

  • Starting with this DataFrame (only to create something similar to yours):

    import numpy as np
    df = pd.DataFrame({'C1': np.random.choice(['SC', 'ST', 'GEN'], p=[0.16, 0.08, 0.76], 
                                              size=1000)})
    df.loc[df.sample(frac=0.22).index] = np.nan
    

    It yields a column with 22% NaN and the remaining proportions are similar to yours:

    df['C1'].value_counts(normalize=True, dropna=False)
    Out: 
    GEN    0.583
    NaN    0.220
    SC     0.132
    ST     0.065
    Name: C1, dtype: float64
    
    df['C1'].value_counts(normalize=True)
    Out: 
    GEN    0.747436
    SC     0.169231
    ST     0.083333
    Name: C1, dtype: float64
    

    Now you can use fillna with np.random.choice:

    df['C1'] = df['C1'].fillna(pd.Series(np.random.choice(['SC', 'ST', 'GEN'], 
                                                          p=[0.16, 0.08, 0.76], size=len(df))))
    

    The resulting column will have these proportions:

    df['C1'].value_counts(normalize=True, dropna=False)
    Out: 
    GEN    0.748
    SC     0.165
    ST     0.087
    Name: C1, dtype: float64