Search code examples
apache-sparkmachine-learningemr

Spark Decision tree fit runs in 1 task


I am trying to "train" a DecisionTreeClassifier using Apache Spark running in a cluster in Amazon EMR. Even though I can see that there are around 50 Executors added and that the features are created by querying a Postgres database using SparkSQL and stored in a DataFrame. The DesisionTree fit method takes for many hours even though the Dataset is not that big (10.000 db entries with a couple of hundreds of bytes each row).

I can see that there is only one task for this so I assume this is the reason that it's been so slow.

Where should I look for the reason that this is running in one task? Is it the way that I retrieve the data? I am sorry if this is a bit vague but I don't know if the code that retrieves the data is relevant, or is it a parameter in the algorithm (although I didn't find anything online), or is it just Spark tuning? I would appreciate any direction!

Thanks in advance.


Solution

  • Spark relies on data locality. It seems that all the data is located in a single place. Hence spark uses a single partition to process it. You could apply a repartition or state the number of partitions you would like to use at load time. I would also look into the decision tree Api and see if you can set the number of partitions for it specifically. Basically, partitions are your level of parallelism.