Search code examples
pythonpandasnumpyrandomlist-comprehension

Loop function for missing data


I want to change NaN values with np.random.normal(mu,s,n) function with the list comprehension method, but I couldn't.

df_column_values = ["NaN","1","NaN","2","NaN","3","94","4","168","5","NaN"]

n, mu, sigma = 700, 155, 118
array = np.random.normal(mu, sigma, n)
for i in array:
    if i > 0 and i < 400:    
        data['Insulin'].replace(0,(i), inplace=True)  

This function works, but the output is same for all NaN values. How can I improve this code?

Raw data from Kaggle


Solution

  • It looks like you want to replace missing values with normally distributed random values within a range (0, 400). You need to use truncated normal distribution for this.

    Then you should create a vector of random variables of the same length as the data you are potentially replacing.

    data = pd.DataFrame({'Insulin': ["NaN","1","NaN","2","NaN","3",
    "94","4","168","5","NaN"]})
    ​
    import scipy.stats as stats
    ​
    lower, upper = 0, 400
    mu, sigma = 155, 118
    X = stats.truncnorm(
        (lower - mu) / sigma, 
        (upper - mu) / sigma, 
        loc=mu, scale=sigma)
    ​
    data['Insulin'] = np.where(
         data['Insulin']=="NaN", 
         X.rvs(len(data)),
         data['Insulin'])
    
    data['Insulin'] = np.where(
         data['Insulin'].isna(), 
         X.rvs(len(data)),
         data['Insulin'])
    
    print(data)
    
           Insulin
    0    59.069239
    1            1
    2   113.143013
    3            2
    4    63.488282
    5            3
    6           94
    7            4
    8          168
    9            5
    10  109.272469