I am attempting to get best hyperparameters for XGBClassifier that would lead to getting most predictive attributes. I am attempting to use RandomizedSearchCV to iterate and validate through KFold.
As I run this process total 5 times (numFolds=5), I want the best results to be saved in a dataframe called collector (specified below). So each iteration, I would want best results and score to append to collector dataframe.
from scipy import stats
from scipy.stats import randint
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import
precision_score,recall_score,accuracy_score,f1_score,roc_auc_score
clf_xgb = xgb.XGBClassifier(objective = 'binary:logistic')
param_dist = {'n_estimators': stats.randint(150, 1000),
'learning_rate': stats.uniform(0.01, 0.6),
'subsample': stats.uniform(0.3, 0.9),
'max_depth': [3, 4, 5, 6, 7, 8, 9],
'colsample_bytree': stats.uniform(0.5, 0.9),
'min_child_weight': [1, 2, 3, 4]
}
clf = RandomizedSearchCV(clf_xgb, param_distributions = param_dist, n_iter = 25, scoring = 'roc_auc', error_score = 0, verbose = 3, n_jobs = -1)
numFolds = 5
folds = cross_validation.KFold(n = len(X), shuffle = True, n_folds = numFolds)
collector = pd.DataFrame()
estimators = []
results = np.zeros(len(X))
score = 0.0
for train_index, test_index in folds:
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf.fit(X_train, y_train)
estimators.append(clf.best_estimator_)
estcoll = pd.DataFrame(estimators)
estcoll['score'] = score
pd.concat([collector,estcoll])
print "\n", len(collector), "\n"
score /= numFolds
For some reason there is nothing being saved to the dataframe, please help.
Also, I have about 350 attributes to cycle through with 3.5K rows in train and 2K in testing. Would running this through bayesian hyperparameter optimization process potentially improve my results? or it would only save on processing time?
RandomizedSearchCV()
will do more for you than you realize. Explore the cv_results
attribute of your fitted CV object at the documentation page
Here's your code pretty much unchanged. The two changes I added:
n_iter=5
from 25. This will do 5 sets of parameters, which with your 5-fold cross-validation means 25 total fits.kfold
object before RandomizedSearchCV, and then referenced it in the consruction of RandomizedSearchCV as the cv
param_
clf_xgb = xgb.XGBClassifier(objective = 'binary:logistic')
param_dist = {'n_estimators': stats.randint(150, 1000),
'learning_rate': stats.uniform(0.01, 0.59),
'subsample': stats.uniform(0.3, 0.6),
'max_depth': [3, 4, 5, 6, 7, 8, 9],
'colsample_bytree': stats.uniform(0.5, 0.4),
'min_child_weight': [1, 2, 3, 4]
}
numFolds = 5
kfold_5 = cross_validation.KFold(n = len(X), shuffle = True, n_folds = numFolds)
clf = RandomizedSearchCV(clf_xgb,
param_distributions = param_dist,
cv = kfold_5,
n_iter = 5, # you want 5 here not 25 if I understand you correctly
scoring = 'roc_auc',
error_score = 0,
verbose = 3,
n_jobs = -1)
Here's where my answer deviates from your code significantly. Just fit the randomizedsearchcv
object once, no need to loop. It handles the CV looping with it's cv
argument.
clf.fit(X_train, y_train)
All your cross-valdated results are now in clf.cv_results_
. For example, you can get cross-validated (mean across 5 folds) train score with:
clf.cv_results_['mean_train_score']
or cross-validated test-set (held-out data) score with clf.cv_results_['mean_test_score']
. You can also get other useful things like mean_fit_time
, params
, and clf
, once fitted, will automatically remember your best_estimator_
as an attribute.
These are what are relevant for determining the best set of hyperparameters for model-fitting. A single set of hyperparameters is constant for each of the 5-folds used in a single iteration from n_iter
, so you don't have to peer into the different scores between folds within an iteration.