Search code examples
scalaapache-sparkrdd

What does the number after the ShuffledRDD['number'] indicates?


I am trying to read a filepath from the hdfs path and doing some transformations on top and at last applying some custom partitioning on top of it.

Below is the Code :

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path} 
val extensionPattern: String = ".*.csv"
val numPartitions = 96
val fs = FileSystem.get(new Configuration())
val rdd = fs
      .listStatus(new Path(path))
      .filter(x => x.getPath.toString.matches("^.*/" + extensionPattern + "$"))
      .map(x => x.getPath.toString)
      .toList
      .zipWithIndex
      .map(_.swap)
      .toDS
      .rdd.partitionBy(new ExactPartitioner(numPartitions))

Every time I run this command I see a different value for the ShuffledRDD[] value. It keeps on increasing but I am not able to understand what is the purpose of it and why it is needed. Also does it affect the operations that I am doing after this lines of code.

You can see the output of the notebook cell as below :

enter image description here

The number that is highlighted in the image keeps on changing.


Solution

  • According to the toString definition of an RDD, the number corresponds to the ID of the RDD object (3rd argument in the format string):

    override def toString: String = "%s%s[%d] at %s".format(
        Option(name).map(_ + " ").getOrElse(""), getClass.getSimpleName, id, getCreationSite)
    

    The number shouldn't affect anything in the code. It keeps increasing every time you do an operation because a new RDD is created every time you run the line that defines val rdd.