Search code examples
amazon-web-servicespysparkaws-glue

Extract first value from array<string> in AWS Glue DynamicFrame


My Json structure in S3 is as below. I have successfully crawled it into Data Catalog tables and imported it into a DynamicFrame.

{
    "ColumnA": "Value",
    "ColumnB": [
        "Value"
    ],
    "ColumnC": "Value",
    "ColumnD": "Value"
}

Schema of the DynamicFrame

root
|-- columnA: string
|-- columnB: array
|    |-- element: string
|-- columnC: string
|-- columnD: string

Although columnB is an array type, there is only 1 value in it. I have no control over the source which generates these JSON files so I have to work with this format.

I need to push this to a Redshift table which has the below schema.

+--------+-------+-------+-------+
| ColumnA|ColumnB|ColumnC|ColumnD|
+--------+-------+-------+-------+

While column A/C/D are fairly straightforward, how do I pull the first value from 'ColumnB' array in the DynamicFrame to be able to write to the Redshift table?


Solution

  • From Spark-2.4+:

    Use element_at function to get first value from array

    Example:

    df=spark.createDataFrame([("value",["value"],"value","value")],["ColumnA","ColumnB","ColumnC","ColumnD"])
    df.printSchema()
    #root
    # |-- ColumnA: string (nullable = true)
    # |-- ColumnB: array (nullable = true)
    # |    |-- element: string (containsNull = true)
    # |-- ColumnC: string (nullable = true)
    # |-- ColumnD: string (nullable = true)
    
    from pyspark.sql.functions import *
    
    df.withColumn("ColumnB",element_at(col("ColumnB"),1)).show()
    #+-------+-------+-------+-------+
    #|ColumnA|ColumnB|ColumnC|ColumnD|
    #+-------+-------+-------+-------+
    #|  value|  value|  value|  value|
    #+-------+-------+-------+-------+
    

    For spark < 2.4:

    #Using .getItem(0)
    
    df.withColumn("ColumnB",col("ColumnB").getItem(0)).show()
    #+-------+-------+-------+-------+
    #|ColumnA|ColumnB|ColumnC|ColumnD|
    #+-------+-------+-------+-------+
    #|  value|  value|  value|  value|
    #+-------+-------+-------+-------+
    
    #using index
    df.withColumn("ColumnB",col("ColumnB")[0]).show()
    #+-------+-------+-------+-------+
    #|ColumnA|ColumnB|ColumnC|ColumnD|
    #+-------+-------+-------+-------+
    #|  value|  value|  value|  value|
    #+-------+-------+-------+-------+