Search code examples
pysparkaws-glue

Combining "parallel arrays" into Dataframe structure


I have the following data structure in json that I am trying to get into a dataframe using AWS Glue:

{
    "out": [
        {
            "attr": [ "a1", "a2", "a3" ],
            "val": [ 1, 2, 3 ],
            "text": "test1"
        },
        {
            "attr": [ "a4", "a5", "a6" ],
            "val": [ 4, 5, 6 ],
            "text": "test2"
        }
        
    ],
    "ids": [
        "id1",
        "id2"
    ]    
}

The "ids" field is a parallel array to the entries in "out". I've been trying to get the following:

id     text     attr            val
--     ----     ----            ---
id1    test1    [a1, a2, a3]    [1,2,3]
id2    test2    [a4, a5, a6]    [4,5,6]

I've been able to split the ids and the content of "out" into two dataframes, but I cannot find a way to concatenate them horizontally.

Using

spark_context = SparkContext.getOrCreate()
glue_context = GlueContext(spark_context)
spark = glue_context.spark_session

print("Loading data...")
df = spark.read.json(<location>)
df.printSchema()

I get the following schema:

root
 |-- out: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- attr: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- val: array (nullable = true)
 |    |    |    |-- element: double (containsNull = true)
 |    |    |-- text: string (nullable = true)
 |-- id: array (nullable = true)
 |    |-- element: string (containsNull = true)

Solution

  • With the dataframe api,

    df = spark.read.json('test.json', multiLine=True)
    
    df.select(f.explode(f.arrays_zip('ids', 'out')).alias('data')) \
      .select('data.ids', 'data.out.*') \
      .show(truncate=False)
    
    +---+------------+-----+---------+
    |ids|attr        |text |val      |
    +---+------------+-----+---------+
    |id1|[a1, a2, a3]|test1|[1, 2, 3]|
    |id2|[a4, a5, a6]|test2|[4, 5, 6]|
    +---+------------+-----+---------+