Search code examples
python-3.xsklearn-pandas

How to replace the missing values of train and test with mean of the data


I have preprocessed the dataset, converted the categorical values to dummies and certain columns to float,i have performed train_test_split now i want to replace the missing values with mean of the column but seperately

side note -

but doesn't it addd to the test data, we have to have test data seperately right?, the instructor told me i have to impute train and test data seperately, but when i impute missing values of train_data to the test_data, then doesn't that mean i am just replacing missing values of test with mean of train, that means i am tainting my test_data which is not a good practice as we should treate the test_data as the absolute future value. that didn't make sense to me as to why sould we impute the mean of train data to the test data, doesn't it mean adding the train data to test data.

If we can't use the test data why did we replace the missing values of test data with mean of training dataset

And i want to know What is the syntax to replace the missing values of train and test, because i am getting an error for this code

for col in ld_train.columns():
   if ld_train[col].isnull().sum()>0:
       ld_train.loc[ld_train[col].isnull(),col] = ld_train[col].mean()

for col in ld_test.columns():
   if ld_test[col].isnull().sum()>0:
       ld_test.loc[ld_test[col].isnull(),col] = ld_train[col].mean()

error -

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-82-b844a1a6af73> in <module>
      1 for col in x_train:
----> 2     x_train[col] = x_train[col].fillna(x_train[col].mean())
      3 
      4 for col in x_test.columns.value:
      5      x_test[col] = x_train[col].fillna(x_train[col].mean())

IndexError: arrays used as indices must be of integer (or boolean) type

Solution

  • Your error is because you tried calling train.columns() which is not callable, try using train.columns in the for loop and it might work just fine.
    You could use pandas library to do the same.
    Use the code :

    ld_train.fillna(ld_train.mean())