Search code examples
sqlapache-sparkpysparkapache-spark-sqlgrouping

Mismatched input 'x' expecting {<EOF>, ';'} when using GROUPING SETS


I am using Spark SQL to create grouping sets with total columns to reduce the size of a file which I have previously cubed.

My code is as follows

final_sum = sqlContext.sql('''
  SELECT case when GROUPING(x_flag)=1 then 'total' else x_flag END AS x_flag,
  ...
  from trans,
  group by x_flag,...,
  GROUPING SETS(
    (x_flag,y_name)
    ...
  )
''')

My column x_flag is indexed 0,1 (as integers) and I am trying to get grouping sets based on that column mainly.

I currently get the following:

mismatched input x_flag expecting {, ';'}(line 14, pos 11)

Is there a method to fix this without having to reflag the column as y and n or is there some error which I am missing with using integer in case when?


Solution

  • I could not reproduce the error, but I have created a working example for you.

    df = spark.createDataFrame(
        [(0, 0),
         (0, 1),
         (1, 0),
         (1, 1)],
        ['x_flag', 'y_name'])
    df.createOrReplaceTempView("trans")
    
    spark.sql("""
        SELECT
            CASE when GROUPING(x_flag)=1 then 'total' else x_flag END AS x_flag,
            CASE when GROUPING(y_name)=1 then 'total' else y_name END AS y_name,
            count(*)
        FROM trans
        GROUP BY
            GROUPING SETS(
                (x_flag, y_name),
                (x_flag),
                (y_name)
            )
    """).show()
    # +------+------+--------+
    # |x_flag|y_name|count(1)|
    # +------+------+--------+
    # |     0|     0|       1|
    # |     0| total|       2|
    # | total|     0|       2|
    # | total|     1|       2|
    # |     0|     1|       1|
    # |     1| total|       2|
    # |     1|     0|       1|
    # |     1|     1|       1|
    # +------+------+--------+
    

    Make sure you use correct syntax: in your question, I have noticed a comma after the from clause.
    Also, group by doesn't seem to require additional arguments when using grouping sets.