Search code examples
apache-sparkpysparkrdd

Understanding RDD in PySpark (from parallelize)


I am new to PySpark (or Spark for this matter). I converted a Python list to RDD

name_list_json = [ '{"name": "k"}', '{"name": "b"}', '{"name": "c"}' ]
name_list_rdd = spark.sparkContext.parallelize(name_list_json)
print(name_list_rdd)

This prints out "ParallelCollectionRDD[2] at readRDDFromFile at PythonRDD.scala:262". Two questions here:

  1. What does 2 in ParallelCollectionRDD[2] mean? Is that a number of partitions?

  2. Also why does readRDDFromFile show up here? Is that because the python list is saved to a file and then loaded from the file?


Solution

    1. yes its number of partition which is set to 2 by default and you an repartition it by using repartition()
    2. Its actually referring to readRDDFromFile method. If you want to print the content, you need to do something like collect() before printing