sql apache-spark apache-spark-sql pivot-table databricks

Spark SQL PIVOT - does not allow specifying columns in the SELECT clause

I'm using Spark SQL on Databricks but this may apply to other SQL dialects as well.

I wonder why PIVOT only works when I select all columns

SELECT *
FROM events
PIVOT (
  COUNT(*)
  FOR event_name IN ('cart', 'pillows', 'login', 'main', 'careers', 'guest', 'faq', 'down', 'warranty')
)

BUT stops working when I specify certain columns

SELECT user_id, event_name
FROM events
PIVOT (
  COUNT(*)
  FOR event_name IN ('cart', 'pillows', 'login', 'main', 'careers', 'guest', 'faq', 'down', 'warranty')
)

The error message is quite verbose, here is the first line saying 'event_name' doesn't exist. Error in SQL statement: AnalysisException: Column 'event_name' does not exist. Did you mean one of the following? [finalize, foam, guest, original, premium ...

Solution

The problem that your query doesn't have event_name column in its output.

The Pivot clause will create a separate column for each value of event_name that you filtered using this expression:

FOR event_name IN ('cart', 'pillows', 'login', 'main', 'careers', 'guest', 'faq', 'down', 'warranty')

If you select one (or more) of those columns, it should work.

For example, try:

SELECT user_id, cart, pillows
FROM events
PIVOT (
  COUNT(*)
  FOR event_name IN ('cart', 'pillows', 'login', 'main', 'careers', 'guest', 'faq', 'down', 'warranty')
)

You can check here the official docs with a few examples: https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-pivot.html#examples