I'm trying to merge the data from a dataset as follow:
id | sms | longDescription | OtherFields |
---|---|---|---|
123 | contentSms | ContentDesc | xxx |
123 | contentSms2 | ContentDesc2 | xxx |
123 | contentSms3 | ContentDesc3 | xxx |
456 | contentSms4 | ContentDesc | xxx |
the sms and longDescription have the following structure:
sms:array
|----element:struct
|----content:string
|----languageId:string
The aim is to capture the data with the same Id and merge the column sms
and longDescription
into one array with multiple struct( with the languageID
as key):
id | sms | longDescription | OtherFields |
---|---|---|---|
123 | contentSms, ContentSms2,contentSms3 | ContentDesc,ContentDesc2,ContentDesc3 | xxx |
456 | contentSms4 | ContentDesc | xxx |
I've tried using
x = df.select("*").groupBy("id").agg( collect_list("sms"))
but the result is :
collect_list(longDescription): array (nullable = false)
| |-- element: array (containsNull = false)
| | |-- element: struct (containsNull = true)
| | | |-- content: string (nullable = true)
| | | |-- languageId: string (nullable = true)
which is an array too much, as the goal is to have an array of struct in order to have the following result:
sms: [{content: 'aze', languageId:'en-GB'},{content: 'rty', languageId:'fr-BE'},{content: 'poiu', languageId:'nl-BE'}]
You're looking for flatten
function:
x = df.groupBy("id").agg(flatten(collect_list("sms")))