Search code examples

Different behaviour of same query in Spark 2.3 vs Spark 3.2

I am running a simple query in two versions of spark, 2.3 & 3.2. The code is as below

spark-shell --master yarn --deploy-mode client
val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5")
val op_cols = List("id","col2","col3","col4", "col5", "ID")
val df2 =, op_cols.tail: _*)"id").show()

In spark 2.3 it returns

| id |
| 1  |
| 1  |

But in spark 3.2 it returns

org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous, could be: id, id.;
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:97)

I was expecting both versions to have the same result or at least a configuration to make the behavior consistent. setting don't change behavior


On top of this, when using both columns in same case, it works

val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5")
val op_cols = List("id","col2","col3","col4", "col5", "id")
val df2 =, op_cols.tail: _*)"id").show()

Even further analysis points out that this behavior was introduced in 2.4. I mean the same query fails even in spark version 2.4


  • The error was introduced in Spark 2.4 when code was added under expression. In Spark 2.3 we had distinct on the candidates, but later code only had candidates/prunedCandidates did not have distinct added. Once we add the distinct while doing resolve of attributes for plan the behavior is same as that of 2.3

    PR for this fix is merged in Spark 3.4 branch. See: