I am starting with programming. I would like to treat missing data by replacing it for the mean value of its column instead of its row-mean-value.
I'm trying to pre-process some data, from an Excel datasheet, similar to the one written as follows (the first column, the one with no title, is just the numbering/index; is not a real column):
A B C D E F
0 100 NaN 5.0 1.0 5 a
1 200 5.0 NaN 3.0 3 a
2 300 4.0 NaN 5.0 6 a
3 400 5.0 4.0 7.0 9 b
4 500 5.0 2.0 NaN 2 b
5 600 4.0 3.0 0.0 4 b
The point is that I would like to replace NaN for the mean value of its column. I am using pandas to extract the values from the datasheet. Then I try with sklearn.impute.SimpleImputer to treat these missing data but it just gives me the mean value of the rows.
dataset = pd.read_excel(io=file_name, sheet_name=sheet)
y = dataset.iloc[:, 0].values
X = dataset.iloc[:, 1:-1].values
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')
imputer = imputer.fit(X[:, :])
X[:, :] = imputer.transform(X[:, :])
Does anyone know how could I do that in an efficient way? Is there any method of a class that treats this case? (I do not mind getting into another library).
Thank you very much
If you want to fill NaN
values with the mean of the column, you can simply use:
data = {'A':[100,200,300,400,500,600],'B':[np.nan,5,4,5,5,4],'C':[5,np.nan,np.nan,4,2,3],'D':[1,3,5,7,np.nan,0]}
df = pd.DataFrame(data)
df = df.fillna(df.mean())
print(df)
Output:
A B C D
0 100 4.6 5.0 1.0
1 200 5.0 3.5 3.0
2 300 4.0 3.5 5.0
3 400 5.0 4.0 7.0
4 500 5.0 2.0 3.2
5 600 4.0 3.0 0.0