arrays apache-spark pyspark apache-spark-sql aggregation

Merging column with array from multiple rows

I'm trying to merge the data from a dataset as follow:

id	sms	longDescription	OtherFields
123	contentSms	ContentDesc	xxx
123	contentSms2	ContentDesc2	xxx
123	contentSms3	ContentDesc3	xxx
456	contentSms4	ContentDesc	xxx

the sms and longDescription have the following structure:

sms:array
|----element:struct
      |----content:string
      |----languageId:string

The aim is to capture the data with the same Id and merge the column sms and longDescription into one array with multiple struct( with the languageID as key):

id	sms	longDescription	OtherFields
123	contentSms, ContentSms2,contentSms3	ContentDesc,ContentDesc2,ContentDesc3	xxx
456	contentSms4	ContentDesc	xxx

I've tried using

x = df.select("*").groupBy("id").agg( collect_list("sms"))

but the result is :

collect_list(longDescription): array (nullable = false)
 |    |-- element: array (containsNull = false)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- content: string (nullable = true)
 |    |    |    |-- languageId: string (nullable = true)

which is an array too much, as the goal is to have an array of struct in order to have the following result:

sms: [{content: 'aze', languageId:'en-GB'},{content: 'rty', languageId:'fr-BE'},{content: 'poiu', languageId:'nl-BE'}]

Solution

You're looking for flatten function:

x = df.groupBy("id").agg(flatten(collect_list("sms")))