Search code examples
pythonpandaspysparkdatabricks

Transposing or pivoting a dataframe in databricks pyspark?


Appreciate hints on how to go about pivoting/transposing a dataframe in such a way that I will have the transposed Cx name column as a concatenation of the Cx names separated by a comma. enter image description here

I have found the advice provided here: https://stackoverflow.com/questions/71474821/how-i-can-transpose-a-data-frame-in-pyspark , very useful, however just need some pointers now on next step to have the final results look like this: enter image description here

Code that I used, credits to: https://stackoverflow.com/users/6336479/david-%d7%93%d7%95%d7%93%d7%95-markovitz is shown below:

import pyspark.pandas as ps
from pyspark.sql.functions import desc, asc
ps.get_option("compute.max_rows") # 1000
ps.set_option("compute.max_rows", 2000)
df = spark.createDataFrame([('12345','John Doe',1),('12345','Jane Doe',2)],['Acct' , 'CxName' ,'Position' ])
df = df.sort(asc('position'))
df.display()
(df
#  .to_pandas_on_spark()
 .pandas_api()
 .set_index('Position')
 .T
 .reset_index()
 .rename(columns={"index":"Position"})
 .to_spark()
 .show())

Solution

  • Code

    df1 = df.groupBy('Acct').agg(F.expr("collect_list(struct(Position, CxName)) AS CxName"))
    df1 = df1.withColumn('CxName', F.expr("transform(array_sort(CxName), x -> x['CxName'])"))
    df1 = df1.withColumn('CxName', F.expr("array_join(CxName, ',')"))
    

    How this works

    Group the dataframe by Acct and collect the pairs of Position, CxName

    +-----+------------------------------+
    |Acct |CxName                        |
    +-----+------------------------------+
    |12345|[{1, John Doe}, {2, Jane Doe}]|
    +-----+------------------------------+
    

    Sort the pairs inside array by their position and transform the sorted pairs to extract the value of CxName

    +-----+--------------------+
    |Acct |CxName              |
    +-----+--------------------+
    |12345|[John Doe, Jane Doe]|
    +-----+--------------------+
    

    Join the values in the array around the delimiter ','

    +-----+-----------------+
    |Acct |CxName           |
    +-----+-----------------+
    |12345|John Doe,Jane Doe|
    +-----+-----------------+