Search code examples
pythonamazon-web-servicesaws-glue

AWS Glue Dynamic Frame columns from array


I have a nested json, structured as the following example: {'A':[{'key':'B','value':'C'},{'key':'D','value':'E'}]} Now I want to map this to the following schema:

|--A 
|--|--B
|--|--D

e.g. A structure recovered from a json like:

{'A':{'B':'C','D':'E'}}

The array in 'A' has no fixed number of entries, but the contained dicts always have the two keys 'key','value'


Solution

  • Please find the script below.

    from pyspark.sql.functions import lit, col, explode, create_map, collect_list
    from itertools import chain
    
    >>> sample.printSchema()
    root
     |-- A: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- key: string (nullable = true)
     |    |    |-- value: string (nullable = true)
    
    
    >>> final_df = (sample
    ...             .select(explode('A').alias("A"))
    ...             .withColumn("A",create_map("A.key", "A.value"))
    ...             .groupby().agg(collect_list("A").alias("A"))
    ... )
    >>> final_df.printSchema()
    root
     |-- A: array (nullable = true)
     |    |-- element: map (containsNull = false)
     |    |    |-- key: string
     |    |    |-- value: string (valueContainsNull = true)
    
    >>> final_df.show(truncate=False)
    +--------------------+
    |A                   |
    +--------------------+
    |[[B -> C], [D -> E]]|
    +--------------------+
    
    >>> (final_df
    ...  .write
    ...  .format("json")
    ...  .mode("overwrite")
    ...  .save("sample_files/2020-09-29/out")
    ... )