Search code examples
dataframeapache-sparkpysparkrdd

How to set KryoSerializer in Pyspark?


i'm new to Pyspark Please help me with it:

spark = SparkSession.builder.appName("FlightDelayRDD").master("local[*]").getOrCreate()
sc = spark.sparkContext
sc.setSystemProperty("spark.dynamicAllocation.enabled", "true")
sc.setSystemProperty("spark.dynamicAllocation.initialExecutors", "6")
sc.setSystemProperty("spark.dynamicAllocation.minExecutors", "6")
sc.setSystemProperty("spark.dynamicAllocation.schedulerBacklogTimeout", "0.5s")
sc.setSystemProperty("spark.speculation", "true")

I want to set KryoSerializer in pyspark like i configured above.


Solution

  • From official docs:

    Since Spark 2.0.0, we internally use Kryo serializer when shuffling RDDs with simple types, arrays of simple types, or string type.

    To set Kryo serializer:

    sc.setSystemProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    

    To check:

    spark.sparkContext.getConf().get("spark.serializer")
    
    #u'org.apache.spark.serializer.KryoSerializer'