Search code examples
pythonpandasimputation

How can I impute the NA in a dataframe with values randomly selected from a specified normal distribution


How can I impute the NA in a dataframe with values randomly selected from a specified normal distribution. The dataframe df is defined as follows:

    A   B   C   D
1   3   NA  4   NA
2   3.4 2.3 4.1 NA
3   2.3 0.1 0.2 6.3
4   3.1 4.5 2.1 0.2
5   4.1 2.5 NA  2.4

I want to fill the NA with the values randomly select from a generated normal distribution and the values are different. The mean the normal distribution is the 1% quantile of the values of the given dataframe. The standard deviation is the median SD of the rows in dataframe.

My code is as follows:

import pandas as pd
import numpy as np

df = pd.read_csv('try.txt',sep="\t")
df.index = df['type']
del df['type']
sigma = median(df.std(axis=1))
mu = df.quantile(0.01)
# mean and standard deviation
df = df.fillna(np.random.normal(mu, sigma, 1))

The mean is incorrect and the df can not fill with the simulated array. How can I complete the work. Thank you.


Solution

  • There are a few problems with your code

    df.index = df['type']
    del df['type']
    

    can better be expressed as df.set_index('type')

    median(df.std(axis=1)) should be df.std(axis=1).median()

    df.quantile() returns a series. If you want the quantile of all the values, you should do df.stack().quantile(0.01)

    sigma = df.std(axis=1).median()
    mu = df.stack().quantile(0.01)
    print((sigma, mu))
    
     (0.9539392014169454, 0.115)
    

    First you have to find the empty fields. Easiest is with .stack and pd.isnull

    df2 = df.stack(dropna=False)
    s = df2[pd.isnull(df2)]
    

    Now you can impute the random values in 2 ways

    ran = np.random.normal(mu, sigma, len(s))
    df3 = df.stack(dropna=False)
    df3.loc[s.index] = ran
    df3.unstack()
    
      A   B   C   D
    1 3.0 0.38531116198179066 4.0 0.7070154252582993
    2 3.4 2.3 4.1 -0.8651789931843614
    3 2.3 0.1 0.2 6.3
    4 3.1 4.5 2.1 0.2
    5 4.1 2.5 -1.3176599584973157 2.4
    

    Or via a loop, overwriting the empty fields in the original DataFrame

    for (row, column), value in zip(s.index.tolist(), np.random.normal(mu, sigma, len(s))):
        df.loc[row, column] = value