Why is XGBRegressor prediction warning of feature mismatch?

I want to use XGBRegressor to predict some data. So I load the training data and the test data.

iowa_file_path = '../input/train.csv'
test_data_path = '../input/test.csv'

data = pd.read_csv(iowa_file_path)
test_data = pd.read_csv(test_data_path)

Contents of data

Contents of test_data

Then I do some data cleaning

data.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = data.SalePrice
X = data.drop(['SalePrice'], axis=1).select_dtypes(exclude=['object'])

train_X, val_X, train_y, val_y = train_test_split(X.values, y.values, test_size =0.25)
my_imputer = SimpleImputer()
train_X = my_imputer.fit_transform(train_X)
val_X = my_imputer.transform(val_X)

my_model = XGBRegressor(n_estimators=100, learning_rate=0.1)
my_model.fit(train_X, train_y, early_stopping_rounds=None, 
    eval_set=[(val_X, val_y)], verbose=False)

test_data_process = test_data.select_dtypes(exclude=['object'])
predictions = my_model.predict(test_data_process)

But I get the following error message when running predict function:

ValueError Traceback (most recent call last) in () 1 test_data_process = test_data.select_dtypes(exclude=['object']) ----> 2 predictions = my_model.predict(test_data_process)

/opt/conda/lib/python3.6/site-packages/xgboost-0.80-py3.6.egg/xgboost/sklearn.py in predict(self, data, output_margin, ntree_limit, validate_features) 395 output_margin=output_margin, 396 ntree_limit=ntree_limit, --> 397 validate_features=validate_features) 398 399 def apply(self, X, ntree_limit=0):

/opt/conda/lib/python3.6/site-packages/xgboost-0.80-py3.6.egg/xgboost/core.py in predict(self, data, output_margin, ntree_limit, pred_leaf, pred_contribs, approx_contribs, pred_interactions, validate_features) 1206 1207 if validate_features: -> 1208 self._validate_features(data) 1209 1210 length = c_bst_ulong()

/opt/conda/lib/python3.6/site-packages/xgboost-0.80-py3.6.egg/xgboost/core.py in _validate_features(self, data) 1508 1509 raise ValueError(msg.format(self.feature_names, -> 1510 data.feature_names)) 1511 1512 def get_split_value_histogram(self, feature, fmap='', bins=None, as_pandas=True):

ValueError: feature_names mismatch: ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24', 'f25', 'f26', 'f27', 'f28', 'f29', 'f30', 'f31', 'f32', 'f33', 'f34', 'f35', 'f36'] ['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold'] expected f9, f6, f14, f27, f18, f7, f8, f23, f17, f22, f35, f0, f28, f29, f20, f31, f36, f25, f11, f21, f12, f24, f34, f10, f5, f32, f15, f26, f30, f1, f2, f16, f19, f3, f4, f33, f13 in input data training data did not have the following fields: BsmtUnfSF, 1stFlrSF, LowQualFinSF, MSSubClass, WoodDeckSF, GrLivArea, MiscVal, YearBuilt, BsmtFinSF1, Fireplaces, MoSold, BsmtHalfBath, GarageYrBlt, FullBath, PoolArea, YrSold, HalfBath, 2ndFlrSF, KitchenAbvGr, OverallQual, Id, EnclosedPorch, ScreenPorch, GarageArea, BsmtFullBath, MasVnrArea, TotRmsAbvGrd, OverallCond, BedroomAbvGr, GarageCars, OpenPorchSF, YearRemodAdd, TotalBsmtSF, BsmtFinSF2, LotFrontage, 3SsnPorch, LotArea

It complains that the feature mismatches and that I do not have those fields in the training data. But when I check on content of data, it has those columns. How to resolve it ?

Solution

Just to close the question:

The problem is that SimpleImputer was used on the training and validation data, but not on the test data.

A discussion of what can cause this kind of error can be found here: https://github.com/dmlc/xgboost/issues/2334#issuecomment-333195491