Search code examples
apache-sparkpyspark

What is the #<number> after column name in Spark


I don't have any specific purpose to understand the meaning of the weird names, I am just interested in it.

Here is a simple code for the same.

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

df1 = spark.createDataFrame([['a', 'b'], ['c', 'd']], 'c1: string, c2: string')
df2 = spark.createDataFrame([['a', 'p'], ['c', 'q']], 'c1: string, c3: string')
df1.join(df2, df1.c1 == df2.c1).explain()

It outputs

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- SortMergeJoin [c1#0], [c1#4], Inner
   :- Sort [c1#0 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(c1#0, 200), ENSURE_REQUIREMENTS, [plan_id=191]
   :     +- Filter isnotnull(c1#0)
   :        +- Scan ExistingRDD[c1#0,c2#1]
   +- Sort [c1#4 ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(c1#4, 200), ENSURE_REQUIREMENTS, [plan_id=192]
         +- Filter isnotnull(c1#4)
            +- Scan ExistingRDD[c1#4,c3#5]

The column names are followed by numbers like c1#0 and c2#1. What are these numbers? One thing I can understand is that they help in differentiating the columns with same name in different dataframes like c1#0 and c1#4.

Any help is appreciated.


Solution

  • enter image description here

    This will be used in import org.apache.spark.sql.catalyst.expressions.AttributeReference class for every column in DataFrame

    enter image description here