I went through multiple documents saying that default behavior of performing a cache/persist on a spark RDD stores the RDD as deserialized objects to JVM memory
However, when I ran some test using a sample file (5-6 lines), the Storage Level under Storage section in spark UI always shows as Memory Serialized 1x Replicated
Can anyone help me understand if I am missing anything here?
I did the same test as you, i prepared a small file:
id;name;value
1,test,full
2,test,empty
3,test,important
4,test2,sadfdsf
5,test4,gfdsfgdfg
Then i started 10.4 databricks community Cluster with Spark 3.2.1 and Scala 2.12 and executed this code:
//small files
val rddWhole = spark.sparkContext.textFile("dbfs:/FileStore/shared_uploads/[email protected]/very_small_csv.csv")
rddWhole.cache().count()
as the result in SparkUI i can see this:
Same applies for bigger file (screenshot its from 3.3 but its the same)
Is it exactly the same on your env?
Edit: i can confirm that as stated in comment for Python its serialized.
I checked source code and i can see a difference between Scala and Python. For rdd they both are using MEMORY_ONLY level for caching but it is defined in other way in Python than i Scala
val MEMORY_ONLY = new StorageLevel(false, true, false, true) //Scala
StorageLevel.MEMORY_ONLY = StorageLevel(False, True, False, False) //Python
Where the last parameter is deserialized so if think thats why its different but for this moment i am not sure what is the reason