Search code examples
pythonapache-sparkpyspark

Is there a way to expand an array like a struct in Pyspark? Star does not work


I have a data frame with the following nesting in Pyspark:

>content: struct
   importantId: string
   >data: array
      >element: struct
         importantCol0: string
         importantCol1: string

I need the following output:

importantId importantCol0 importantCol1
10800005 0397AZ 0397AZ
10800006 0397BZ 0397BZ

I tried the following code: df1 = df0.select(F.col('content.*'))

I got:

importantId data
10800005 {importantCol0: 0397AZ, importantCol1: 0397AZ}
10800006 {importantCol0: 0397BZ, importantCol1: 0397BZ}

I followed with:

df2 = df1.select(F.col('importantId'), F.col('data.*')

but I get the following error:

AnalysisException: Can only star expand struct data types. Attribute: ArrayBuffer(data).

Does anyone know how to fix this? I was expecting a way to expand an array like a struct


Solution

  • Let us use inline to explode array of structs to columns and rows

    result = df.select('*', F.inline('data')).drop('data')
    

    Example,

    df.show()
    
    +------------+--------------------+
    |importantCol|                data|
    +------------+--------------------+
    |           1|    [{1, 2}, {4, 3}]|
    |           2|[{10, 20}, {40, 30}]|
    +------------+--------------------+
    
    result.show()
    
    +------------+-------------+-------------+
    |importantCol|importantCol0|importantCol1|
    +------------+-------------+-------------+
    |           1|            1|            2|
    |           1|            4|            3|
    |           2|           10|           20|
    |           2|           40|           30|
    +------------+-------------+-------------+