I am trying to learn about the dataframe alias command in pyspark. Here is what I observe.
Suppose I have a sample dataframe
t1_df = spark.createDataFrame([['a'], ['b']], 'c1: string')
print(t1_df.show())
+---+
| c1|
+---+
| a|
| b|
+---+
Now I have created its alias
t2_df = t1_df.alias('df1')
If I select the second dataframe's column from the first dataframe, it will work fine as follows
t1_df.select(t2_df.c1)
+---+
| c1|
+---+
| a|
| b|
+---+
However, if I try the same with alias, it doesn't work
t1_df.select(col('df1.c1'))
AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `df1`.`c1` cannot be resolved. Did you mean one of the following? [`c1`].;
'Project ['df1.c1]
+- LogicalRDD [c1#3311], false
Why is it so? How does an alias work? I don't have any specific purpose for the following, I am just interested and experimenting with things.
I am using Spark version 3.4.1
The df1
keyword is only recognizable when the dataframe t2_df
is in the query. The df1
keyword itself doesn't create anything special and is just added as an extra notation in the metadata of the column.
To see that, we need to build the code in scala and run the following query.
t2_df.select("c1").queryExecution.analyzed.prettyJson
Output [ { "class" : "org.apache.spark.sql.catalyst.plans.logical.Project", "num-children" : 1, "projectList" : [ [ { "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference", "num-children" : 0, "name" : "c1", "dataType" : { "type" : "array", "elementType" : "string", "containsNull" : true }, "nullable" : true, "metadata" : { }, "exprId" : { "product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId", "id" : 6, "jvmId" : "3271baa8-0b0d-4e6c-bf89-f982cb40a636" }, "qualifier" : "[df1]" // This is the metadata added } ] ], ...