Search code examples
dataframeapache-sparkpysparkjsonschemakey-value

Extract key value from dataframe in PySpark


I have the below dataframe which I have read from a JSON file.

1 2 3 4
{"todo":["wakeup", "shower"]} {"todo":["brush", "eat"]} {"todo":["read", "write"]} {"todo":["sleep", "snooze"]}

I need my output to be as below Key and Value. How do I do this? Do I need to create a schema?

ID todo
1 wakeup, shower
2 brush, eat
3 read, write
4 sleep, snooze

Solution

  • The key-value which you refer to is a struct. "keys" are struct field names, while "values" are field values.

    What you want to do is called unpivoting. One of the ways to do it in PySpark is using stack. The following is a dynamic approach, where you don't need to provide existent column names.

    Input dataframe:

    df = spark.createDataFrame(
        [((['wakeup', 'shower'],),(['brush', 'eat'],),(['read', 'write'],),(['sleep', 'snooze'],))],
        '`1` struct<todo:array<string>>, `2` struct<todo:array<string>>, `3` struct<todo:array<string>>, `4` struct<todo:array<string>>')
    

    Script:

    to_melt = [f"\'{c}\', `{c}`.todo" for c in df.columns]
    df = df.selectExpr(f"stack({len(to_melt)}, {','.join(to_melt)}) (ID, todo)")
    
    df.show()
    # +---+----------------+
    # | ID|            todo|
    # +---+----------------+
    # |  1|[wakeup, shower]|
    # |  2|    [brush, eat]|
    # |  3|   [read, write]|
    # |  4| [sleep, snooze]|
    # +---+----------------+