Search code examples
pysparkapache-spark-sqlspark3

Pyspark SelectExp() not working for first() and last()


i have 2 statements which are to my knowledge exactly alike, but select() works fine, but selectExpr() generates following results.

+-----------------------+----------------------+
|first(StockCode, false)|last(StockCode, false)|
+-----------------------+----------------------+
|                 85123A|                 22138|
+-----------------------+----------------------+

+-----------+----------+
|first_value|last_value|
+-----------+----------+
|  StockCode| StockCode|
+-----------+----------+

following is implementation.

df.select(first(col("StockCode")), last(col("StockCode"))).show()
df.selectExpr("""first('StockCode') as first_value""", """last('StockCode') as last_value""").show()

Can any 1 explain the behaviour?


Solution

  • selectExpr takes everything as select clause in sql. Hence if you write anything in single quote', it will act as string in sql. if you wanted to pass the column to selectExpr use backtique (`) as below-

    df.selectExpr("""first(`StockCode`) as first_value""", """last(`StockCode`) as last_value""").show()
    

    backtique will help you to escape space in the column.

    you can use without backtique also if your column name is not starting with number like 12col or it doesn't have spaces in between like column name

    df.selectExpr("""first(StockCode) as first_value""", """last(StockCode) as last_value""").show()