Search code examples
scalaapache-sparkrdd

How to find spark RDD/Dataframe size?


I know how to find the file size in scala.But how to find a RDD/dataframe size in spark?

Scala:

object Main extends App {
  val file = new java.io.File("hdfs://localhost:9000/samplefile.txt").toString()
  println(file.length)
}

Spark:

val distFile = sc.textFile(file)
println(distFile.length)

but if i process it not getting file size. How to find the RDD size?


Solution

  • Yes Finally I got the solution. Include these libraries.

    import org.apache.spark.sql.Row
    import org.apache.spark.rdd.RDD
    import org.apache.spark.rdd
    

    How to find the RDD Size:

    def calcRDDSize(rdd: RDD[String]): Long = {
      rdd.map(_.getBytes("UTF-8").length.toLong)
         .reduce(_+_) //add the sizes together
    }
    

    Function to find DataFrame size: (This function just convert DataFrame to RDD internally)

    val dataFrame = sc.textFile(args(1)).toDF() // you can replace args(1) with any path
    
    val rddOfDataframe = dataFrame.rdd.map(_.toString())
    
    val size = calcRDDSize(rddOfDataframe)