Search code examples
apache-sparkpysparkapache-spark-sql

Add a column to multilevel nested structure in pyspark


I have a pyspark dataframe with below structure.

Current Schema:

root
 |-- ID
 |-- Information
 |   |-- Name
 |   |-- Age
 |   |-- Gender
 |-- Description

I would like to add first name and last name to Information.Name

Is there a way to add new columns so multi level struct types in pyspark?

Expected Schema:

root
 |-- ID
 |-- Information
 |   |-- Name
 |   |   |-- firstName
 |   |   |-- lastName
 |   |-- Age
 |   |-- Gender
 |-- Description

Solution

  • Use withField, this would work:

    df=df.withColumn('Information', F.col('Information').withField('Name', F.struct(*[F.col('Information.Name').alias('FName'), F.lit('').alias('LName')])))
    

    Schema Before:

    root
     |-- Id: string (nullable = true)
     |-- Information: struct (nullable = true)
     |    |-- Name: string (nullable = true)
     |    |-- Age: integer (nullable = true)
    

    Schema After:

    root
     |-- Id: string (nullable = true)
     |-- Information: struct (nullable = true)
     |    |-- Name: struct (nullable = false)
     |    |    |-- FName: string (nullable = true)
     |    |    |-- LName: string (nullable = false)
     |    |-- Age: integer (nullable = true)
    

    I initialized the value of Fname with the current value of Name, you can use substring if that is needed.