Search code examples
pythonpyspark

Pyspark - from long to wide with new column names


I have this dataframe:

data = [
    {"name": "test", "sentiment": 'positive', "avg": 13.65, "stddev": 15.24},
    {"name": "test", "sentiment": 'neutral', "avg": 338.74, "stddev": 187.27},
    {"name": "test", "sentiment": 'negative', "avg": 54.58, "stddev": 50.19}
]

df = spark.createDataFrame(data).select("name", "sentiment", "avg", "stddev")
df.show()
+----+---------+------+------+
|name|sentiment|   avg|stddev|
+----+---------+------+------+
|test|positive | 13.65| 15.24|
|test|neutral  |338.74|187.27|
|test|negative | 54.58| 50.19|
+----+---------+------+------+

I'd like to create a dataframe with this structure:

+----+------------+-----------+------------+------------+-----------+------------+
|name|avg_positive|avg_neutral|avg_negative|std_positive|std_neutral|std_negative|
+----+------------+-----------+------------+------------+-----------+------------+
|test|       13.65|     338.74|       54.58|       15.24|     187.27|       50.19|
+----+------------+-----------+------------+------------+-----------+------------+

I also don't know the name of this operation, feel free to suggest a proper title.

Thanks!


Solution

  • You should use .groupBy() and .pivot().

    df_grp = (
        df
        .groupBy("name")
        .pivot("sentiment")
        .agg(
            F.first("avg").alias("avg"),
            F.first("stddev").alias("stddev"),
        )
    )
    df_grp.show()
    
    +----+------------+---------------+-----------+--------------+------------+---------------+
    |name|negative_avg|negative_stddev|neutral_avg|neutral_stddev|positive_avg|positive_stddev|
    +----+------------+---------------+-----------+--------------+------------+---------------+
    |test|       54.58|          50.19|     338.74|        187.27|       13.65|          15.24|
    +----+------------+---------------+-----------+--------------+------------+---------------+
    

    You can rename the columns if you really want to.