Pyspark : RDD cache always serializes data

I went through multiple documents saying that default behavior of performing a cache/persist on a spark RDD stores the RDD as deserialized objects to JVM memory

However, when I ran some test using a sample file (5-6 lines), the Storage Level under Storage section in spark UI always shows as Memory Serialized 1x Replicated

Can anyone help me understand if I am missing anything here?

Solution

I did the same test as you, i prepared a small file:

id;name;value
1,test,full
2,test,empty
3,test,important
4,test2,sadfdsf
5,test4,gfdsfgdfg

Then i started 10.4 databricks community Cluster with Spark 3.2.1 and Scala 2.12 and executed this code:

//small files
val rddWhole = spark.sparkContext.textFile("dbfs:/FileStore/shared_uploads/[email protected]/very_small_csv.csv") 
rddWhole.cache().count()

as the result in SparkUI i can see this:

Same applies for bigger file (screenshot its from 3.3 but its the same)

Is it exactly the same on your env?

Edit: i can confirm that as stated in comment for Python its serialized.

I checked source code and i can see a difference between Scala and Python. For rdd they both are using MEMORY_ONLY level for caching but it is defined in other way in Python than i Scala

Scala source code

Python source code

val MEMORY_ONLY = new StorageLevel(false, true, false, true) //Scala
StorageLevel.MEMORY_ONLY = StorageLevel(False, True, False, False) //Python

Where the last parameter is deserialized so if think thats why its different but for this moment i am not sure what is the reason