Search code examples
python-3.xpandasdata-sciencesklearn-pandas

Whats does X of imputer = imputer.fit(X[:,1:3]) stand for, whats the meaning of imputer.fit(X[:,1:3])?


I m working on a preprocessing a data set, i get the error cause of the line imputer = imputer.fit(X[:,1:3]). Which i dont get? I understand imputer = Imputer(missing_values = "NaN", strategy = "mean"), means replace missing values with mean value both in columns and rows. Then are we trying to fit into the model the data, which is what i dont understand?


import pandas as pd 
from sklearn import svm
import matplotlib.pylot as plt %matplotlib inline

from sklearn.preprocessing import Imputer
import seaborn as sns; sns.set(font_scale=1.2)

stock=pd.read_csv("C:/Users/Dulangi/Downloads/winequality-red.csv")
stock.head()

g=sns.lmplot('alcohol','quality',data=stock,height=7, truncate=True, scatter_kws={"s":100})
imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)

imputer = imputer.fit(X[:,1:3])

The error i get


NameError                                 Traceback (most recent call last)
<ipython-input-4-620c08822929> in <module>
     14 imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)
     15 
---> 16 imputer = imputer.fit(X[:,1:3])

NameError: name 'X' is not defined

NameError                                 Traceback (most recent call last)
<ipython-input-4-620c08822929> in <module>
     14 imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)
     15 
---> 16 imputer = imputer.fit(X[:,1:3])

NameError: name 'X' is not defined

Solution

  • We use imputer from sci-kit library, and that is to fill missing values, we fill missing values using mean or mode of the considered column in data set.

    In [:,1:3], the left side before the comma indicates to select all rows in data set, you can even specify a range of rows to select as instead of : say we said 1:10, then it selects first 10 rows.

    The right side after the comma indicates to select first 3 column, from 1:3, we can even say just : to indicate select all columns.

    Then fit actually stores the mean or mode value as calculated on the training data set,using strategy as we assigned to fill in the missing value, then uses it on test data during the transform.

    Refer these to get even better idea

    https://www.youtube.com/watch?v=fCMrO_VzeL8&t=515s

    https://www.youtube.com/watch?v=oH3wYKvwpJ8&t=1s

    https://medium.com/@kanchanardj/jargon-in-python-used-in-data-science-to-laymans-language-part-two-98787cce0928