Appreciate hints on how to go about pivoting/transposing a dataframe in such a way that I will have the transposed Cx name column as a concatenation of the Cx names separated by a comma.
I have found the advice provided here: https://stackoverflow.com/questions/71474821/how-i-can-transpose-a-data-frame-in-pyspark , very useful, however just need some pointers now on next step to have the final results look like this:
Code that I used, credits to: https://stackoverflow.com/users/6336479/david-%d7%93%d7%95%d7%93%d7%95-markovitz is shown below:
import pyspark.pandas as ps
from pyspark.sql.functions import desc, asc
ps.get_option("compute.max_rows") # 1000
ps.set_option("compute.max_rows", 2000)
df = spark.createDataFrame([('12345','John Doe',1),('12345','Jane Doe',2)],['Acct' , 'CxName' ,'Position' ])
df = df.sort(asc('position'))
df.display()
(df
# .to_pandas_on_spark()
.pandas_api()
.set_index('Position')
.T
.reset_index()
.rename(columns={"index":"Position"})
.to_spark()
.show())
df1 = df.groupBy('Acct').agg(F.expr("collect_list(struct(Position, CxName)) AS CxName"))
df1 = df1.withColumn('CxName', F.expr("transform(array_sort(CxName), x -> x['CxName'])"))
df1 = df1.withColumn('CxName', F.expr("array_join(CxName, ',')"))
Group the dataframe by Acct
and collect the pairs of Position
, CxName
+-----+------------------------------+
|Acct |CxName |
+-----+------------------------------+
|12345|[{1, John Doe}, {2, Jane Doe}]|
+-----+------------------------------+
Sort the pairs inside array by their position and transform the sorted pairs to extract the value of CxName
+-----+--------------------+
|Acct |CxName |
+-----+--------------------+
|12345|[John Doe, Jane Doe]|
+-----+--------------------+
Join the values in the array around the delimiter ','
+-----+-----------------+
|Acct |CxName |
+-----+-----------------+
|12345|John Doe,Jane Doe|
+-----+-----------------+