We have a Spark environment which should process 50MM rows. These rows contains a key column. The unique number of keys are close to 2000. I would like to process all of those 2000 keys in parallel. Thus we are using a Spark sql like the following
hiveContext.sql("select * from BigTbl DISTRIBUTE by KEY")
Subsequently we have a mapPartitions that works nicely on all the partitions in parallel. However the trouble is, it creates only 200 partitions by default. Using a command like the following i am able to increase the partitions
hiveContext.sql("set spark.sql.shuffle.partitions=500");
However during real production run i would not know what is the number of unique keys. I want this to be auto managed. Is there a way to do this please.
Thanks
Bala
I would suggest that you use "repartition" function and then register the repartitioned as a new temp table and also further cache it for faster processing.
val distinctValues = hiveContext.sql("select KEY from BigTbl").distinct().count() // find count distinct values
hiveContext.sql("select * from BigTbl DISTRIBUTE by KEY")
.repartition(distinctValues.toInt) // repartition to number of distinct values
.registerTempTable("NewBigTbl") // register the repartitioned table as another temp table
hiveContext.cacheTable("NewBigTbl") // cache the repartitioned table for improving query performance
for further queries use "NewBigTbl"