Search code examples
sqlapache-sparkpivottransformunion

Convert columns to rows in Spark SQL


I have some data like this:

ID Value1 Value2 Value40
101 3 520 2001
102 29 530 2020

I want to take this data and convert in to a KV style pair instead

ID ValueVv ValueDesc
101 3 Value1
101 520 Value2
101 2001 Value40

I think it's a pivot, but I can't think of what this needs to look like in code.

I am trying to solve in PySQL but also in a Python DataFrame as I am using Spark.

I could easily, just union each column into an output using SQL, but I was hoping there is a more efficient way?

I've looked at melt as an option and stack. But I'm unsure how to do this effectively.


Solution

  • It's the opposite of pivot - it's called unpivot.
    In Spark, unpivoting is implemented using stack function.

    Using PySpark, this is what you could do if you didn't have many columns:

    from pyspark.sql import SparkSession, functions as F
    spark = SparkSession.builder.getOrCreate()
    df = spark.createDataFrame(
        [(101, 3, 520, 2001),
         (102, 29, 530, 2020)],
        ['ID', 'Value1', 'Value2', 'Value40'])
    
    df = df.select(
        "ID",
        F.expr("stack(3, Value1, 'Value1', Value2, 'Value2', Value40, 'Value40') as (ValueVv, ValueDesc)")
    )
    

    From your example I see that you may have lots of columns. In this case you may use something like this:

    cols_to_unpivot = [f"`{c}`, \'{c}\'" for c in df.columns if c != 'ID']
    stack_string = ", ".join(cols_to_unpivot)
    df = df.select(
        "ID",
        F.expr(f"stack({len(cols_to_unpivot)}, {stack_string}) as (ValueVv, ValueDesc)")
    )
    

    For the example data both versions return

    +---+-------+---------+
    | ID|ValueVv|ValueDesc|
    +---+-------+---------+
    |101|      3|   Value1|
    |101|    520|   Value2|
    |101|   2001|  Value40|
    |102|     29|   Value1|
    |102|    530|   Value2|
    |102|   2020|  Value40|
    +---+-------+---------+