My Json structure in S3 is as below. I have successfully crawled it into Data Catalog tables and imported it into a DynamicFrame.
"ColumnA": "Value",
"ColumnB": [
"ColumnC": "Value",
"ColumnD": "Value"
Schema of the DynamicFrame
|-- columnA: string
|-- columnB: array
| |-- element: string
|-- columnC: string
|-- columnD: string
Although columnB is an array type, there is only 1 value in it. I have no control over the source which generates these JSON files so I have to work with this format.
I need to push this to a Redshift table which has the below schema.
| ColumnA|ColumnB|ColumnC|ColumnD|
While column A/C/D are fairly straightforward, how do I pull the first value from 'ColumnB' array in the DynamicFrame to be able to write to the Redshift table?
From Spark-2.4+:
Use element_at
function to get first value from array
# |-- ColumnA: string (nullable = true)
# |-- ColumnB: array (nullable = true)
# | |-- element: string (containsNull = true)
# |-- ColumnC: string (nullable = true)
# |-- ColumnD: string (nullable = true)
from pyspark.sql.functions import *
#| value| value| value| value|
For spark < 2.4:
#Using .getItem(0)
#| value| value| value| value|
#using index
#| value| value| value| value|