I want to use XGBRegressor to predict some data. So I load the training data and the test data.
iowa_file_path = '../input/train.csv'
test_data_path = '../input/test.csv'
data = pd.read_csv(iowa_file_path)
test_data = pd.read_csv(test_data_path)
Contents of data
Contents of test_data
Then I do some data cleaning
data.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = data.SalePrice
X = data.drop(['SalePrice'], axis=1).select_dtypes(exclude=['object'])
train_X, val_X, train_y, val_y = train_test_split(X.values, y.values, test_size =0.25)
my_imputer = SimpleImputer()
train_X = my_imputer.fit_transform(train_X)
val_X = my_imputer.transform(val_X)
my_model = XGBRegressor(n_estimators=100, learning_rate=0.1)
my_model.fit(train_X, train_y, early_stopping_rounds=None,
eval_set=[(val_X, val_y)], verbose=False)
test_data_process = test_data.select_dtypes(exclude=['object'])
predictions = my_model.predict(test_data_process)
But I get the following error message when running predict
function:
ValueError Traceback (most recent call last) in () 1 test_data_process = test_data.select_dtypes(exclude=['object']) ----> 2 predictions = my_model.predict(test_data_process)
/opt/conda/lib/python3.6/site-packages/xgboost-0.80-py3.6.egg/xgboost/sklearn.py in predict(self, data, output_margin, ntree_limit, validate_features) 395 output_margin=output_margin, 396 ntree_limit=ntree_limit, --> 397 validate_features=validate_features) 398 399 def apply(self, X, ntree_limit=0):
/opt/conda/lib/python3.6/site-packages/xgboost-0.80-py3.6.egg/xgboost/core.py in predict(self, data, output_margin, ntree_limit, pred_leaf, pred_contribs, approx_contribs, pred_interactions, validate_features) 1206 1207 if validate_features: -> 1208 self._validate_features(data) 1209 1210 length = c_bst_ulong()
/opt/conda/lib/python3.6/site-packages/xgboost-0.80-py3.6.egg/xgboost/core.py in _validate_features(self, data) 1508 1509 raise ValueError(msg.format(self.feature_names, -> 1510 data.feature_names)) 1511 1512 def get_split_value_histogram(self, feature, fmap='', bins=None, as_pandas=True):
ValueError: feature_names mismatch: ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24', 'f25', 'f26', 'f27', 'f28', 'f29', 'f30', 'f31', 'f32', 'f33', 'f34', 'f35', 'f36'] ['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold'] expected f9, f6, f14, f27, f18, f7, f8, f23, f17, f22, f35, f0, f28, f29, f20, f31, f36, f25, f11, f21, f12, f24, f34, f10, f5, f32, f15, f26, f30, f1, f2, f16, f19, f3, f4, f33, f13 in input data training data did not have the following fields: BsmtUnfSF, 1stFlrSF, LowQualFinSF, MSSubClass, WoodDeckSF, GrLivArea, MiscVal, YearBuilt, BsmtFinSF1, Fireplaces, MoSold, BsmtHalfBath, GarageYrBlt, FullBath, PoolArea, YrSold, HalfBath, 2ndFlrSF, KitchenAbvGr, OverallQual, Id, EnclosedPorch, ScreenPorch, GarageArea, BsmtFullBath, MasVnrArea, TotRmsAbvGrd, OverallCond, BedroomAbvGr, GarageCars, OpenPorchSF, YearRemodAdd, TotalBsmtSF, BsmtFinSF2, LotFrontage, 3SsnPorch, LotArea
It complains that the feature mismatches and that I do not have those fields in the training data. But when I check on content of data
, it has those columns. How to resolve it ?
Just to close the question:
The problem is that SimpleImputer
was used on the training and validation data, but not on the test data.
A discussion of what can cause this kind of error can be found here: https://github.com/dmlc/xgboost/issues/2334#issuecomment-333195491