Search code examples
pythonpysparkcross-validation

cross validation of GBT Classifier on PySpark taking too much time on 2 GB data(80% Train & 20 % Test). Is there a way to reduce the time?


cross validation of GBT Classifier on PySpark taking too much time on 2 GB data(80% Train & 20 % Test). Is there a way to reduce the time? The sample code is as given below:-

dt = GBTClassifier(maxIter = 250)
pipeline_dt = Pipeline(stages=[indexer, assembler, dt])
paramGrid = ParamGridBuilder().build()  
crossval = CrossValidator(estimator=pipeline_dt, estimatorParamMaps=paramGrid,
   evaluator=BinaryClassificationEvaluator(),numFolds=6)    
   cvModel = crossval.fit(train_df)

Solution

  • By default, evaluation is run in the serial manner - next round is done after previous is finished. Starting with Spark 2.3, there is a parallelism parameter that specifies how many evaluations may run in parallel.

    P.S. If you'll add parameter search as well, then I would recommend to look to the Hyperopt library that improves hyperparameters search.