Search code examples
dataframeapache-sparkaws-glue

Spark Dataframe - Merge Nested Columns into one


I'm trying to merge a dataframe that has multiple nested struct columns into one single struct column. Here's a same schema

Before:

|- column_1
|- column_2
|- column_3
|- nested_details_1
   |- a
   |- b
   |- c
|- nested_details_2
   |- d
   |- e
   |- f

Desired after:

|- column_1
|- column_2
|- column_3
|- nested_details
   |- a
   |- b
   |- c
   |- d
   |- e
   |- f

Visually I can see what needs to be done, but is there a function with data frames to merge columns for me? If it makes any difference I am using AWS Glue Dynamic Dataframe but I can convert easily enough to a normal data frame.


Solution

  • pyspark

    I managed to merge 2 structs using the pyspark struct function like this:

    from pyspark.sql import SparkSession
    from pyspark.sql.functions import struct, col, expr
    
    spark = SparkSession.builder.getOrCreate()
    
    input_df = spark.sql("""
        SELECT 
            1 as column_1,
            2 as column_2,
            3 as column_3,
            named_struct("a", 1, "b", 2, "c", 3) as nested_details_1,
            named_struct("d", 4, "e", 5, "f", 6) as nested_details_2
    """)
    merged_df = input_df.select(
        col("column_1"),
        col("column_2"),
        col("column_3"),
        struct(col("nested_details_1.*"),col("nested_details_2.*")).alias("nested_details")
    )
    merged_df.printSchema()
    merged_df.show()
    merged_df.select(expr("typeof(nested_details)")).show(truncate=False)
    

    above prints

    root
     |-- column_1: integer (nullable = false)
     |-- column_2: integer (nullable = false)
     |-- column_3: integer (nullable = false)
     |-- nested_details: struct (nullable = false)
     |    |-- a: integer (nullable = false)
     |    |-- b: integer (nullable = false)
     |    |-- c: integer (nullable = false)
     |    |-- d: integer (nullable = false)
     |    |-- e: integer (nullable = false)
     |    |-- f: integer (nullable = false)
    
    +--------+--------+--------+------------------+
    |column_1|column_2|column_3|    nested_details|
    +--------+--------+--------+------------------+
    |       1|       2|       3|{1, 2, 3, 4, 5, 6}|
    +--------+--------+--------+------------------+
    
    +-------------------------------------------+
    |typeof(nested_details)                     |
    +-------------------------------------------+
    |struct<a:int,b:int,c:int,d:int,e:int,f:int>|
    +-------------------------------------------+
    

    Spark SQL

    same can be achieved with SQL also using SQL struct function like this:

    with input as (
    SELECT 
        1 as column_1,
        2 as column_2,
        3 as column_3,
        named_struct("a", 1, "b", 2, "c", 3) as nested_details_1,
        named_struct("d", 4, "e", 5, "f", 6) as nested_details_2 
    )
    SELECT 
        column_1,
        column_2, 
        column_3, 
        struct(nested_details_1.*, nested_details_2.*) as nested_details
    FROM input