Search code examples
pythonapache-sparkpysparkapache-spark-sql

Pyspark: explode json in column to multiple columns


The data looks like this -

+-----------+-----------+-----------------------------+
|         id|      point|                         data|
+-----------------------------------------------------+
|        abc|          6|{"key1":"124", "key2": "345"}|
|        dfl|          7|{"key1":"777", "key2": "888"}|
|        4bd|          6|{"key1":"111", "key2": "788"}|

I am trying to break it into the following format.

+-----------+-----------+-----------+-----------+
|         id|      point|       key1|       key2|
+------------------------------------------------
|        abc|          6|        124|        345|
|        dfl|          7|        777|        888|
|        4bd|          6|        111|        788|

The explode function explodes the dataframe into multiple rows. But that is not the desired solution.

Note: This solution does not answers my questions. PySpark "explode" dict in column


Solution

  • As long as you are using Spark version 2.1 or higher, pyspark.sql.functions.from_json should get you your desired result, but you would need to first define the required schema

    from pyspark.sql.functions import from_json, col
    from pyspark.sql.types import StructType, StructField, StringType
    
    schema = StructType(
        [
            StructField('key1', StringType(), True),
            StructField('key2', StringType(), True)
        ]
    )
    
    df.withColumn("data", from_json("data", schema))\
        .select(col('id'), col('point'), col('data.*'))\
        .show()
    

    which should give you

    +---+-----+----+----+
    | id|point|key1|key2|
    +---+-----+----+----+
    |abc|    6| 124| 345|
    |df1|    7| 777| 888|
    |4bd|    6| 111| 788|
    +---+-----+----+----+