I know how to find the file size in scala.But how to find a RDD/dataframe size in spark?
Scala:
object Main extends App {
val file = new java.io.File("hdfs://localhost:9000/samplefile.txt").toString()
println(file.length)
}
Spark:
val distFile = sc.textFile(file)
println(distFile.length)
but if i process it not getting file size. How to find the RDD size?
Yes Finally I got the solution. Include these libraries.
import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
import org.apache.spark.rdd
How to find the RDD Size:
def calcRDDSize(rdd: RDD[String]): Long = {
rdd.map(_.getBytes("UTF-8").length.toLong)
.reduce(_+_) //add the sizes together
}
Function to find DataFrame size: (This function just convert DataFrame to RDD internally)
val dataFrame = sc.textFile(args(1)).toDF() // you can replace args(1) with any path
val rddOfDataframe = dataFrame.rdd.map(_.toString())
val size = calcRDDSize(rddOfDataframe)