Search code examples
pysparkamazon-emr

EMR Pyspark does not see computed columns when running select statements


I have a rather strange issue in a managed pyspark environment that's hosted on EMR 6.10.1

When running this query:

spark.sql("select 1 as a, a+a as b, b+b as d").show()

On local machine, databricks any other pyspark instance I am getting proper results. However, when I am running that query on an EMR cluster I am getting pyspark.sql.utils.AnalysisException: Column 'a' does not exist. Did you mean one of the following ? []

Does anyone know which setting is causing this sort of behavior?


Solution

  • This feature is called lateral column alias references and it was introduced in Spark 3.4. EMR 6.10 has Spark 3.3, that's why it raises the exception.