python pandas missing-data sklearn-pandas imputation

Method for missing data in rows?

I am starting with programming. I would like to treat missing data by replacing it for the mean value of its column instead of its row-mean-value.

I'm trying to pre-process some data, from an Excel datasheet, similar to the one written as follows (the first column, the one with no title, is just the numbering/index; is not a real column):

    A    B    C    D   E  F
0  100  NaN  5.0  1.0  5  a
1  200  5.0  NaN  3.0  3  a
2  300  4.0  NaN  5.0  6  a
3  400  5.0  4.0  7.0  9  b
4  500  5.0  2.0  NaN  2  b
5  600  4.0  3.0  0.0  4  b

The point is that I would like to replace NaN for the mean value of its column. I am using pandas to extract the values from the datasheet. Then I try with sklearn.impute.SimpleImputer to treat these missing data but it just gives me the mean value of the rows.

dataset = pd.read_excel(io=file_name, sheet_name=sheet)
y = dataset.iloc[:, 0].values
X = dataset.iloc[:, 1:-1].values
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')
imputer = imputer.fit(X[:, :])
X[:, :] = imputer.transform(X[:, :])

Does anyone know how could I do that in an efficient way? Is there any method of a class that treats this case? (I do not mind getting into another library).

Thank you very much

Solution

If you want to fill NaN values with the mean of the column, you can simply use:

data = {'A':[100,200,300,400,500,600],'B':[np.nan,5,4,5,5,4],'C':[5,np.nan,np.nan,4,2,3],'D':[1,3,5,7,np.nan,0]}
df = pd.DataFrame(data)
df = df.fillna(df.mean())
print(df)

Output:

     A    B    C    D
0  100  4.6  5.0  1.0
1  200  5.0  3.5  3.0
2  300  4.0  3.5  5.0
3  400  5.0  4.0  7.0
4  500  5.0  2.0  3.2
5  600  4.0  3.0  0.0