I am trying to read a filepath from the hdfs path and doing some transformations on top and at last applying some custom partitioning on top of it.
Below is the Code :
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
val extensionPattern: String = ".*.csv"
val numPartitions = 96
val fs = FileSystem.get(new Configuration())
val rdd = fs
.listStatus(new Path(path))
.filter(x => x.getPath.toString.matches("^.*/" + extensionPattern + "$"))
.map(x => x.getPath.toString)
.toList
.zipWithIndex
.map(_.swap)
.toDS
.rdd.partitionBy(new ExactPartitioner(numPartitions))
Every time I run this command I see a different value for the ShuffledRDD[] value. It keeps on increasing but I am not able to understand what is the purpose of it and why it is needed. Also does it affect the operations that I am doing after this lines of code.
You can see the output of the notebook cell as below :
The number that is highlighted in the image keeps on changing.
According to the toString
definition of an RDD, the number corresponds to the ID of the RDD object (3rd argument in the format string):
override def toString: String = "%s%s[%d] at %s".format(
Option(name).map(_ + " ").getOrElse(""), getClass.getSimpleName, id, getCreationSite)
The number shouldn't affect anything in the code. It keeps increasing every time you do an operation because a new RDD is created every time you run the line that defines val rdd
.