Search code examples
python-3.xpandassklearn-pandashyperopt

KeyError: '[...] not in index' occurs when train/test sets are split manually into two files


I get the error KeyError: '[...] Not in index' when using an sklearn hyperopt regression example on my dataset.

I have seen other answers to this problem where the solution was that, e.g, X_train should be set to X_train = X.iloc[train_indices] and the lack of iloc usage was the issue. But in my problem, I have manually split my dataset into two files so I don't need do do any slicing or indexing. I used a different script to take a big data set and split it into a train set file and a test set file. These files do not have index columns and are only numeric. If you are wondering about the data set it is from UCI and called the protein physiochemical dataset.

from hpsklearn import HyperoptEstimator, any_regressor, xgboost_regression
from sklearn.datasets import load_iris
from hyperopt import tpe
import numpy as np
import pandas as pd

# Download the data and split into training and test sets

X_train = pd.read_csv('data2/CASP_train.csv')
X_test = pd.read_csv('data2/CASP_test.csv')

y_train = X_train['Y']
y_test = X_test['Y']

X_train.drop('Y',axis=1,inplace=True)
X_test.drop('Y',axis=1,inplace=True)
print(list(X_test))
#X_train.drop(list(X_train)[0],axis=1,inplace=True)
#X_test.drop(list(X_test)[0],axis=1,inplace=True)
print(list(X_test))
print(X_train)
# Instantiate a HyperoptEstimator with the search space and number of evaluations

estim = HyperoptEstimator(regressor=xgboost_regression('xgreg'),
                          preprocessing=('my_pre'),
                          algo=tpe.suggest,
                          max_evals=100,
                          trial_timeout=120)

estim.fit(X_train, y_train)

print(estim.score(X_test, y_test))
print(estim.best_model())


The full full traceback is as follows

Traceback (most recent call last):
  File "PRSAXGB.py", line 30, in <module>
    estim.fit(X_train, y_train)
  File "/home/rj/anaconda3/lib/python3.6/site-packages/hpsklearn/estimator.py", line 783, in fit
    fit_iter.send(increment)
  File "/home/rj/anaconda3/lib/python3.6/site-packages/hpsklearn/estimator.py", line 693, in fit_iter
    return_argmin=False, # -- in case no success so far
  File "/home/rj/anaconda3/lib/python3.6/site-packages/hyperopt/fmin.py", line 389, in fmin
    show_progressbar=show_progressbar,
  File "/home/rj/anaconda3/lib/python3.6/site-packages/hyperopt/base.py", line 643, in fmin
    show_progressbar=show_progressbar)
  File "/home/rj/anaconda3/lib/python3.6/site-packages/hyperopt/fmin.py", line 408, in fmin
    rval.exhaust()
  File "/home/rj/anaconda3/lib/python3.6/site-packages/hyperopt/fmin.py", line 262, in exhaust
    self.run(self.max_evals - n_done, block_until_done=self.asynchronous)
  File "/home/rj/anaconda3/lib/python3.6/site-packages/hyperopt/fmin.py", line 227, in run
    self.serial_evaluate()
  File "/home/rj/anaconda3/lib/python3.6/site-packages/hyperopt/fmin.py", line 141, in serial_evaluate
    result = self.domain.evaluate(spec, ctrl)
  File "/home/rj/anaconda3/lib/python3.6/site-packages/hyperopt/base.py", line 848, in evaluate
    rval = self.fn(pyll_rval)
  File "/home/rj/anaconda3/lib/python3.6/site-packages/hpsklearn/estimator.py", line 656, in fn_with_timeout
    raise fn_rval[1]
KeyError: '[    0     1     2 ... 29264 29265 29266] not in index'

Solution

  • The solution was to do estim.fit(X_train.values, y_train.values)