Search code examples
apache-sparkcachingpyspark

Pyspark : RDD cache always serializes data


I went through multiple documents saying that default behavior of performing a cache/persist on a spark RDD stores the RDD as deserialized objects to JVM memory

However, when I ran some test using a sample file (5-6 lines), the Storage Level under Storage section in spark UI always shows as Memory Serialized 1x Replicated

Can anyone help me understand if I am missing anything here?


Solution

  • I did the same test as you, i prepared a small file:

    id;name;value
    1,test,full
    2,test,empty
    3,test,important
    4,test2,sadfdsf
    5,test4,gfdsfgdfg
    

    Then i started 10.4 databricks community Cluster with Spark 3.2.1 and Scala 2.12 and executed this code:

    //small files
    val rddWhole = spark.sparkContext.textFile("dbfs:/FileStore/shared_uploads/[email protected]/very_small_csv.csv") 
    rddWhole.cache().count()
    

    as the result in SparkUI i can see this:

    enter image description here

    Same applies for bigger file (screenshot its from 3.3 but its the same)

    enter image description here

    Is it exactly the same on your env?

    Edit: i can confirm that as stated in comment for Python its serialized.

    Python SparkUI

    I checked source code and i can see a difference between Scala and Python. For rdd they both are using MEMORY_ONLY level for caching but it is defined in other way in Python than i Scala

    Scala source code

    Python source code

    val MEMORY_ONLY = new StorageLevel(false, true, false, true) //Scala
    StorageLevel.MEMORY_ONLY = StorageLevel(False, True, False, False) //Python
    

    Where the last parameter is deserialized so if think thats why its different but for this moment i am not sure what is the reason