I am trying to use GridSearchCV to tune parameters in LightGBM model, but I am not familiar enough with how to save each predicted result in each iteration of GridSearchCV.
But sadly, I only know how to save the result in a specific parameter.
Here is the code:
param = {
'bagging_freq': 5,
'bagging_fraction': 0.4,
'boost_from_average':'false',
'boost': 'gbdt',
'feature_fraction': 0.05,
'learning_rate': 0.01,
'max_depth': -1,
'metric':'auc',
'min_data_in_leaf': 80,
'min_sum_hessian_in_leaf': 10.0,
'num_leaves': 13,
'num_threads': 8,
'tree_learner': 'serial',
'objective': 'binary',
'verbosity': 1
}
features = [c for c in train_df.columns if c not in ['ID_code', 'target']]
target = train_df['target']
folds = StratifiedKFold(n_splits=10, shuffle=False, random_state=44000)
oof = np.zeros(len(train_df))
predictions = np.zeros(len(test_df))
for fold_, (trn_idx, val_idx) in enumerate(folds.split(train_df.values, target.values)):
print("Fold {}".format(fold_))
trn_data = lgb.Dataset(train_df.iloc[trn_idx][features], label=target.iloc[trn_idx])
val_data = lgb.Dataset(train_df.iloc[val_idx][features], label=target.iloc[val_idx])
num_round = 1000000
clf = lgb.train(param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=1000, early_stopping_rounds = 3000)
oof[val_idx] = clf.predict(train_df.iloc[val_idx][features], num_iteration=clf.best_iteration)
predictions += clf.predict(test_df[features], num_iteration=clf.best_iteration) / folds.n_splits
print("CV score: {:<8.5f}".format(roc_auc_score(target, oof)))
print('Saving the Result File')
res= pd.DataFrame({"ID_code": test.ID_code.values})
res["target"] = predictions
res.to_csv('result_10fold{}.csv'.format(num_sub), index=False)
Here is the data:
train_df.head(3)
ID_code target var_0 var_1 ... var_199
0 train_0 0 8.9255 -6.7863 -9.2834
1 train_1 1 11.5006 -4.1473 7.0433
2 train_2 0 8.6093 -2.7457 -9.0837
train_df.head(3)
ID_code var_0 var_1 ... var_199
0 test_0 9.4292 11.4327 -2.3805
1 test_1 5.0930 11.4607 -9.2834
2 train_2 7.8928 10.5825 -9.0837
I want to save each predictions
of each iteration of GridSearchCV and I have searched several similar questions and some other relevant information of using GridSearchCV in LightGBM.
BUT I still can't code it right.
SO, if not mind, could anyone help me and give some tutorials about it?
Thanks sincerely.
You can use the ParameterGrid
or ParameterSampler
from sklearn to do parameter sampling- it will correspond to the GridSearchCV
and RandomSearchCV
, respectively. For example,
def train_lgb(num_folds=11, param=param_original):
...
return predictions, sub
params = {
# your base parameters
}
# define the grid for parameter sampling
from sklearn.model_selection import ParameterGrid
par_grid = ParameterGrid([{'bagging_freq':[6,7]},
{'num_leaves': [13,15]}
])
prediction_list = {}
sub_list = {}
import copy
for i, ps in enumerate(par_grid):
print('This is param{}'.format(i))
# copy the base params dictionary and update with sampled values
val = copy.deepcopy(params)
val.update(ps)
# main training loop
prediction, sub = train_lgb(param=val)
prediction_list.update({key: prediction})
sub_list.update({key: sub})
Edit: By the way, I realized that i was investigating the same issue recently and was learning how to address using some ML tools. I've created a page summarising how to use MLflow for this task: https://mlisovyi.github.io/KaggleSantander2019/ (and the associated github page for the actual code). Note, that it by accident is based on the same data that you are working on :). I hope it will be useful.