Search code examples
jsonapache-sparkpyspark

How to read several JSON files with different column count into one Dataframe in Spark


I have 2 JSON files, one is like:

{
  "a":{
    "a1":"xxx"
  },
  "b":"xxx"
}

Another one is like:

{
  "a":{
    "a1":"xxx",
    "a2":"xxx"
  },
  "b":"xxx"
}

And I want to read these two JSON files into one Dataframe in Spark. I tried to use union and unionByName but they didn't work. How can I achieve this?


Solution

  • Spark can take care of merging the schema. See the following code:

    >>> spark.read.option("multiLine", True).json("test-jsons/*").printSchema()
    root
     |-- a: struct (nullable = true)
     |    |-- a1: string (nullable = true)
     |    |-- a2: string (nullable = true)
     |-- b: string (nullable = true)
    
    >>> spark.read.option("multiLine", True).json("test-jsons/*").show()
    +-----------+---+
    |          a|  b|
    +-----------+---+
    | {xxx, xxx}|xxx|
    |{xxx, NULL}|xxx|
    +-----------+---+