I have some data like this:
ID | Value1 | Value2 | Value40 |
---|---|---|---|
101 | 3 | 520 | 2001 |
102 | 29 | 530 | 2020 |
I want to take this data and convert in to a KV style pair instead
ID | ValueVv | ValueDesc |
---|---|---|
101 | 3 | Value1 |
101 | 520 | Value2 |
101 | 2001 | Value40 |
I think it's a pivot, but I can't think of what this needs to look like in code.
I am trying to solve in PySQL but also in a Python DataFrame as I am using Spark.
I could easily, just union each column into an output using SQL, but I was hoping there is a more efficient way?
I've looked at melt
as an option and stack
. But I'm unsure how to do this effectively.
It's the opposite of pivot - it's called unpivot.
In Spark, unpivoting is implemented using stack
function.
Using PySpark, this is what you could do if you didn't have many columns:
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[(101, 3, 520, 2001),
(102, 29, 530, 2020)],
['ID', 'Value1', 'Value2', 'Value40'])
df = df.select(
"ID",
F.expr("stack(3, Value1, 'Value1', Value2, 'Value2', Value40, 'Value40') as (ValueVv, ValueDesc)")
)
From your example I see that you may have lots of columns. In this case you may use something like this:
cols_to_unpivot = [f"`{c}`, \'{c}\'" for c in df.columns if c != 'ID']
stack_string = ", ".join(cols_to_unpivot)
df = df.select(
"ID",
F.expr(f"stack({len(cols_to_unpivot)}, {stack_string}) as (ValueVv, ValueDesc)")
)
For the example data both versions return
+---+-------+---------+
| ID|ValueVv|ValueDesc|
+---+-------+---------+
|101| 3| Value1|
|101| 520| Value2|
|101| 2001| Value40|
|102| 29| Value1|
|102| 530| Value2|
|102| 2020| Value40|
+---+-------+---------+