I have this dataframe:
data = [
{"name": "test", "sentiment": 'positive', "avg": 13.65, "stddev": 15.24},
{"name": "test", "sentiment": 'neutral', "avg": 338.74, "stddev": 187.27},
{"name": "test", "sentiment": 'negative', "avg": 54.58, "stddev": 50.19}
]
df = spark.createDataFrame(data).select("name", "sentiment", "avg", "stddev")
df.show()
+----+---------+------+------+
|name|sentiment| avg|stddev|
+----+---------+------+------+
|test|positive | 13.65| 15.24|
|test|neutral |338.74|187.27|
|test|negative | 54.58| 50.19|
+----+---------+------+------+
I'd like to create a dataframe with this structure:
+----+------------+-----------+------------+------------+-----------+------------+
|name|avg_positive|avg_neutral|avg_negative|std_positive|std_neutral|std_negative|
+----+------------+-----------+------------+------------+-----------+------------+
|test| 13.65| 338.74| 54.58| 15.24| 187.27| 50.19|
+----+------------+-----------+------------+------------+-----------+------------+
I also don't know the name of this operation, feel free to suggest a proper title.
Thanks!
You should use .groupBy()
and .pivot()
.
df_grp = (
df
.groupBy("name")
.pivot("sentiment")
.agg(
F.first("avg").alias("avg"),
F.first("stddev").alias("stddev"),
)
)
df_grp.show()
+----+------------+---------------+-----------+--------------+------------+---------------+
|name|negative_avg|negative_stddev|neutral_avg|neutral_stddev|positive_avg|positive_stddev|
+----+------------+---------------+-----------+--------------+------------+---------------+
|test| 54.58| 50.19| 338.74| 187.27| 13.65| 15.24|
+----+------------+---------------+-----------+--------------+------------+---------------+
You can rename the columns if you really want to.