I have the following data structure in json that I am trying to get into a dataframe using AWS Glue:
{
"out": [
{
"attr": [ "a1", "a2", "a3" ],
"val": [ 1, 2, 3 ],
"text": "test1"
},
{
"attr": [ "a4", "a5", "a6" ],
"val": [ 4, 5, 6 ],
"text": "test2"
}
],
"ids": [
"id1",
"id2"
]
}
The "ids" field is a parallel array to the entries in "out". I've been trying to get the following:
id text attr val
-- ---- ---- ---
id1 test1 [a1, a2, a3] [1,2,3]
id2 test2 [a4, a5, a6] [4,5,6]
I've been able to split the ids and the content of "out" into two dataframes, but I cannot find a way to concatenate them horizontally.
Using
spark_context = SparkContext.getOrCreate()
glue_context = GlueContext(spark_context)
spark = glue_context.spark_session
print("Loading data...")
df = spark.read.json(<location>)
df.printSchema()
I get the following schema:
root
|-- out: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- attr: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- val: array (nullable = true)
| | | |-- element: double (containsNull = true)
| | |-- text: string (nullable = true)
|-- id: array (nullable = true)
| |-- element: string (containsNull = true)
With the dataframe api,
df = spark.read.json('test.json', multiLine=True)
df.select(f.explode(f.arrays_zip('ids', 'out')).alias('data')) \
.select('data.ids', 'data.out.*') \
.show(truncate=False)
+---+------------+-----+---------+
|ids|attr |text |val |
+---+------------+-----+---------+
|id1|[a1, a2, a3]|test1|[1, 2, 3]|
|id2|[a4, a5, a6]|test2|[4, 5, 6]|
+---+------------+-----+---------+