I want to change NaN values with np.random.normal(mu,s,n)
function with the list comprehension method, but I couldn't.
df_column_values = ["NaN","1","NaN","2","NaN","3","94","4","168","5","NaN"]
n, mu, sigma = 700, 155, 118
array = np.random.normal(mu, sigma, n)
for i in array:
if i > 0 and i < 400:
data['Insulin'].replace(0,(i), inplace=True)
This function works, but the output is same for all NaN values. How can I improve this code?
Raw data from Kaggle
It looks like you want to replace missing values with normally distributed random values within a range (0, 400). You need to use truncated normal distribution for this.
Then you should create a vector of random variables of the same length as the data you are potentially replacing.
data = pd.DataFrame({'Insulin': ["NaN","1","NaN","2","NaN","3",
"94","4","168","5","NaN"]})
import scipy.stats as stats
lower, upper = 0, 400
mu, sigma = 155, 118
X = stats.truncnorm(
(lower - mu) / sigma,
(upper - mu) / sigma,
loc=mu, scale=sigma)
data['Insulin'] = np.where(
data['Insulin']=="NaN",
X.rvs(len(data)),
data['Insulin'])
data['Insulin'] = np.where(
data['Insulin'].isna(),
X.rvs(len(data)),
data['Insulin'])
print(data)
Insulin
0 59.069239
1 1
2 113.143013
3 2
4 63.488282
5 3
6 94
7 4
8 168
9 5
10 109.272469