Python Hyperparameter Optimization for XGBClassifier using RandomizedSearchCV

I am attempting to get best hyperparameters for XGBClassifier that would lead to getting most predictive attributes. I am attempting to use RandomizedSearchCV to iterate and validate through KFold.

As I run this process total 5 times (numFolds=5), I want the best results to be saved in a dataframe called collector (specified below). So each iteration, I would want best results and score to append to collector dataframe.

 from scipy import stats
 from scipy.stats import randint
 from sklearn.model_selection import RandomizedSearchCV
 from sklearn.metrics import 
 precision_score,recall_score,accuracy_score,f1_score,roc_auc_score

clf_xgb = xgb.XGBClassifier(objective = 'binary:logistic')
param_dist = {'n_estimators': stats.randint(150, 1000),
              'learning_rate': stats.uniform(0.01, 0.6),
              'subsample': stats.uniform(0.3, 0.9),
              'max_depth': [3, 4, 5, 6, 7, 8, 9],
              'colsample_bytree': stats.uniform(0.5, 0.9),
              'min_child_weight': [1, 2, 3, 4]
             }
clf = RandomizedSearchCV(clf_xgb, param_distributions = param_dist, n_iter = 25, scoring = 'roc_auc', error_score = 0, verbose = 3, n_jobs = -1)

numFolds = 5
folds = cross_validation.KFold(n = len(X), shuffle = True, n_folds = numFolds)

collector = pd.DataFrame()
estimators = []
results = np.zeros(len(X))
score = 0.0

for train_index, test_index in folds:
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    clf.fit(X_train, y_train)
    estimators.append(clf.best_estimator_)
    estcoll = pd.DataFrame(estimators)


    estcoll['score'] = score
    pd.concat([collector,estcoll])
    print "\n", len(collector), "\n"
score /= numFolds

For some reason there is nothing being saved to the dataframe, please help.

Also, I have about 350 attributes to cycle through with 3.5K rows in train and 2K in testing. Would running this through bayesian hyperparameter optimization process potentially improve my results? or it would only save on processing time?

Solution

RandomizedSearchCV() will do more for you than you realize. Explore the cv_results attribute of your fitted CV object at the documentation page

Here's your code pretty much unchanged. The two changes I added:

I changed n_iter=5 from 25. This will do 5 sets of parameters, which with your 5-fold cross-validation means 25 total fits.
I defined your kfold object before RandomizedSearchCV, and then referenced it in the consruction of RandomizedSearchCV as the cv param

clf_xgb = xgb.XGBClassifier(objective = 'binary:logistic')
param_dist = {'n_estimators': stats.randint(150, 1000),
              'learning_rate': stats.uniform(0.01, 0.59),
              'subsample': stats.uniform(0.3, 0.6),
              'max_depth': [3, 4, 5, 6, 7, 8, 9],
              'colsample_bytree': stats.uniform(0.5, 0.4),
              'min_child_weight': [1, 2, 3, 4]
             }

numFolds = 5
kfold_5 = cross_validation.KFold(n = len(X), shuffle = True, n_folds = numFolds)

clf = RandomizedSearchCV(clf_xgb, 
                         param_distributions = param_dist,
                         cv = kfold_5,  
                         n_iter = 5, # you want 5 here not 25 if I understand you correctly 
                         scoring = 'roc_auc', 
                         error_score = 0, 
                         verbose = 3, 
                         n_jobs = -1)

Here's where my answer deviates from your code significantly. Just fit the randomizedsearchcv object once, no need to loop. It handles the CV looping with it's cv argument.

clf.fit(X_train, y_train)

All your cross-valdated results are now in clf.cv_results_. For example, you can get cross-validated (mean across 5 folds) train score with: clf.cv_results_['mean_train_score'] or cross-validated test-set (held-out data) score with clf.cv_results_['mean_test_score']. You can also get other useful things like mean_fit_time, params, and clf, once fitted, will automatically remember your best_estimator_ as an attribute.

These are what are relevant for determining the best set of hyperparameters for model-fitting. A single set of hyperparameters is constant for each of the 5-folds used in a single iteration from n_iter, so you don't have to peer into the different scores between folds within an iteration.