Search code examples
jsonapache-sparkpysparkapache-spark-sqlpyspark-schema

Read a nested json string and explode into multiple columns in pyspark


I want to parse a JSON request and create multiple columns out of it in pyspark as follows:

{
  "ID": "abc123",
  "device": "mobile",
  "Ads": [
    {
      "placement": "topright",
      "Adlist": [
        {
          "name": "ad1",
          "subtype": "placeholder1",
          "category": "socialmedia",
        },
        {
          "name": "ad2",
          "subtype": "placeholder2",
          "category": "media",
        },
        {
          "name": "ad3",
          "subtype": "placeholder3",
          "category": "printingpress",
        }
      ]
    },
    {
      "Placement": "bottomleft",
      "Adlist": [
        {
          "name": "ad4",
          "subtype": "placeholder4",
          "category": "socialmedia",
        },
        {
          "name": "ad5",
          "subtype": "placeholder5",
          "category": "media",
        },
        {
          "name": "ad6",
          "subtype": "placeholder6",
          "category": "printingpress",
        }
      ]
    }
  ]
}

I tried the following:

from pyspark import SparkContext
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))

rdd = sc.parallelize([sample_json])
df = spark.read.option('multiline', "true").json(rdd)

df.printSchema()
root
 |-- Ads: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Adlist: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- category: string (nullable = true)
 |    |    |    |    |-- name: string (nullable = true)
 |    |    |    |    |-- subtype: string (nullable = true)
 |    |    |-- placement: string (nullable = true)
 |-- ID: string (nullable = true)
 |-- device: string (nullable = true)

The resulting dataframe looks like this:

+--------------------+------+------+
|                 Ads|    ID|device|
+--------------------+------+------+
|[{[{socialmedia, ...|abc123|mobile|
+--------------------+------+------+

However, the output I am looking for is as follows:

Id    | device | placement | name  | subtype      | category      |
-------------------------------------------------------------------
abc123| mobile | topright  | ad1   | placeholder1 | socialmedia   |
abc123| mobile | topright  | ad2   | placeholder2 | media         |
abc123| mobile | topright  | ad3   | placeholder3 | printingpress |
abc123| mobile |bottomleft | ad4   | placeholder4 | socialmedia   |
abc123| mobile |bottomleft | ad5   | placeholder5 | media         |
abc123| mobile |bottomleft | ad6   | placeholder6 | printingpress |

Solution

  • A combination of explode and ".*" will do the trick:

    df = df.select(col("Id"), col("device"), explode(col("Ads")).alias("Ads")) \
        .select("Id", "device", "Ads.*") \
        .select(col("Id"), col("device"), explode(col("Adlist")).alias("Adlist"), col("placement")) \
        .select("Id", "device", "placement", "Adlist.*")
    df.show()
    
    +------+------+---------+-------------+----+------------+
    |    Id|device|placement|     category|name|     subtype|
    +------+------+---------+-------------+----+------------+
    |abc123|mobile| topright|  socialmedia| ad1|placeholder1|
    |abc123|mobile| topright|        media| ad2|placeholder2|
    |abc123|mobile| topright|printingpress| ad3|placeholder3|
    |abc123|mobile|     null|  socialmedia| ad4|placeholder4|
    |abc123|mobile|     null|        media| ad5|placeholder5|
    |abc123|mobile|     null|printingpress| ad6|placeholder6|
    +------+------+---------+-------------+----+------------+
    

    FYI:

    • explode() is a function that is used to transform a column of array into multiple rows
    • ".*" is used to tranform a struct column into columns of fields of that struxt