Search code examples
sqlapache-sparkapache-spark-sqlpivot-tabledatabricks

Spark SQL PIVOT - does not allow specifying columns in the SELECT clause


I'm using Spark SQL on Databricks but this may apply to other SQL dialects as well.

I wonder why PIVOT only works when I select all columns

SELECT *
FROM events
PIVOT (
  COUNT(*)
  FOR event_name IN ('cart', 'pillows', 'login', 'main', 'careers', 'guest', 'faq', 'down', 'warranty')
)

BUT stops working when I specify certain columns

SELECT user_id, event_name
FROM events
PIVOT (
  COUNT(*)
  FOR event_name IN ('cart', 'pillows', 'login', 'main', 'careers', 'guest', 'faq', 'down', 'warranty')
)

The error message is quite verbose, here is the first line saying 'event_name' doesn't exist. Error in SQL statement: AnalysisException: Column 'event_name' does not exist. Did you mean one of the following? [finalize, foam, guest, original, premium ...


Solution

  • The problem that your query doesn't have event_name column in its output.

    The Pivot clause will create a separate column for each value of event_name that you filtered using this expression:

    FOR event_name IN ('cart', 'pillows', 'login', 'main', 'careers', 'guest', 'faq', 'down', 'warranty')
    

    If you select one (or more) of those columns, it should work.

    For example, try:

    SELECT user_id, cart, pillows
    FROM events
    PIVOT (
      COUNT(*)
      FOR event_name IN ('cart', 'pillows', 'login', 'main', 'careers', 'guest', 'faq', 'down', 'warranty')
    )
    

    You can check here the official docs with a few examples: https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-pivot.html#examples