Search code examples
arraysapache-sparkpysparkapache-spark-sqlaggregation

Merging column with array from multiple rows


I'm trying to merge the data from a dataset as follow:

id sms longDescription OtherFields
123 contentSms ContentDesc xxx
123 contentSms2 ContentDesc2 xxx
123 contentSms3 ContentDesc3 xxx
456 contentSms4 ContentDesc xxx

the sms and longDescription have the following structure:

sms:array
|----element:struct
      |----content:string
      |----languageId:string

The aim is to capture the data with the same Id and merge the column sms and longDescription into one array with multiple struct( with the languageID as key):

id sms longDescription OtherFields
123 contentSms, ContentSms2,contentSms3 ContentDesc,ContentDesc2,ContentDesc3 xxx
456 contentSms4 ContentDesc xxx

I've tried using

x = df.select("*").groupBy("id").agg( collect_list("sms"))

but the result is :

collect_list(longDescription): array (nullable = false)
 |    |-- element: array (containsNull = false)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- content: string (nullable = true)
 |    |    |    |-- languageId: string (nullable = true)

which is an array too much, as the goal is to have an array of struct in order to have the following result:

sms: [{content: 'aze', languageId:'en-GB'},{content: 'rty', languageId:'fr-BE'},{content: 'poiu', languageId:'nl-BE'}]

Solution

  • You're looking for flatten function:

    x = df.groupBy("id").agg(flatten(collect_list("sms")))