I have the below dataframe which I have read from a JSON file.
1 | 2 | 3 | 4 |
---|---|---|---|
{"todo":["wakeup", "shower"]} | {"todo":["brush", "eat"]} | {"todo":["read", "write"]} | {"todo":["sleep", "snooze"]} |
I need my output to be as below Key and Value. How do I do this? Do I need to create a schema?
ID | todo |
---|---|
1 | wakeup, shower |
2 | brush, eat |
3 | read, write |
4 | sleep, snooze |
The key-value which you refer to is a struct. "keys" are struct field names, while "values" are field values.
What you want to do is called unpivoting. One of the ways to do it in PySpark is using stack
. The following is a dynamic approach, where you don't need to provide existent column names.
Input dataframe:
df = spark.createDataFrame(
[((['wakeup', 'shower'],),(['brush', 'eat'],),(['read', 'write'],),(['sleep', 'snooze'],))],
'`1` struct<todo:array<string>>, `2` struct<todo:array<string>>, `3` struct<todo:array<string>>, `4` struct<todo:array<string>>')
Script:
to_melt = [f"\'{c}\', `{c}`.todo" for c in df.columns]
df = df.selectExpr(f"stack({len(to_melt)}, {','.join(to_melt)}) (ID, todo)")
df.show()
# +---+----------------+
# | ID| todo|
# +---+----------------+
# | 1|[wakeup, shower]|
# | 2| [brush, eat]|
# | 3| [read, write]|
# | 4| [sleep, snooze]|
# +---+----------------+