Replace dataframe with its alias in select in pyspark

I am trying to learn about the dataframe alias command in pyspark. Here is what I observe.

Suppose I have a sample dataframe

t1_df = spark.createDataFrame([['a'], ['b']], 'c1: string')

print(t1_df.show())
+---+
| c1|
+---+
|  a|
|  b|
+---+

Now I have created its alias

t2_df = t1_df.alias('df1')

If I select the second dataframe's column from the first dataframe, it will work fine as follows

t1_df.select(t2_df.c1)

+---+
| c1|
+---+
|  a|
|  b|
+---+

However, if I try the same with alias, it doesn't work

t1_df.select(col('df1.c1'))

AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `df1`.`c1` cannot be resolved. Did you mean one of the following? [`c1`].;
'Project ['df1.c1]
+- LogicalRDD [c1#3311], false

Why is it so? How does an alias work? I don't have any specific purpose for the following, I am just interested and experimenting with things.

I am using Spark version 3.4.1

Solution

The df1 keyword is only recognizable when the dataframe t2_df is in the query. The df1 keyword itself doesn't create anything special and is just added as an extra notation in the metadata of the column.

To see that, we need to build the code in scala and run the following query.

t2_df.select("c1").queryExecution.analyzed.prettyJson

Output

[ {
  "class" : "org.apache.spark.sql.catalyst.plans.logical.Project",
  "num-children" : 1,
  "projectList" : [ [ {
    "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference",
    "num-children" : 0,
    "name" : "c1",
    "dataType" : {
      "type" : "array",
      "elementType" : "string",
      "containsNull" : true
    },
    "nullable" : true,
    "metadata" : { },
    "exprId" : {
      "product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId",
      "id" : 6,
      "jvmId" : "3271baa8-0b0d-4e6c-bf89-f982cb40a636"
    },
    "qualifier" : "[df1]"         // This is the metadata added
  } ] ],
...