Search code examples
scalaapache-sparkdataframerdd

See information of partitions of a Spark Dataframe


One can have an array of partitions of a Spark DataFrame as follows:

> df.rdd.partitions

Is there a way to get more information about partitions? In particular, I would like to see the partition key and the partition boundaries (first and last element within a partition).

This is just for better understanding of how the data is organized.

This is what I tried:

> df.partitions.rdd.head

But this object only has attributes and methods equals hashCode and index.


Solution

  • In case the data is not too large, one can write them to disk as follows:

    df.write.option("header", "true").csv("/tmp/foobar")
    

    The given directory must not exist.