cross validation of GBT Classifier on PySpark taking too much time on 2 GB data(80% Train & 20 % Test). Is there a way to reduce the time? The sample code is as given below:-
dt = GBTClassifier(maxIter = 250)
pipeline_dt = Pipeline(stages=[indexer, assembler, dt])
paramGrid = ParamGridBuilder().build()
crossval = CrossValidator(estimator=pipeline_dt, estimatorParamMaps=paramGrid,
evaluator=BinaryClassificationEvaluator(),numFolds=6)
cvModel = crossval.fit(train_df)
By default, evaluation is run in the serial manner - next round is done after previous is finished. Starting with Spark 2.3, there is a parallelism
parameter that specifies how many evaluations may run in parallel.
P.S. If you'll add parameter search as well, then I would recommend to look to the Hyperopt library that improves hyperparameters search.