I am currently working with the next dataframe:
A | B | C | D | E |
---|---|---|---|---|
1 | 2 | some | null | something A |
1 | 2 | some | something B | null |
And I need the following output:
A | B | C | D | E |
---|---|---|---|---|
1 | 2 | some | something B | something A |
My problem is that I can't made a groupBy using string columns.
I tried using self joining and pivot.
What about something like this?
from pyspark.sql import functions as F
cols_for_groupby = ["A", "B", "C"]
(
df
.groupby(cols_for_groupby)
.agg(*[
F.max(c).alias(c)
for c in df.columns if c not in cols_for_groupby
])
)
If df
is the DataFrame
in the question, the result is:
+---+---+----+-----------+-----------+
| A| B| C| D| E|
+---+---+----+-----------+-----------+
| 1| 2|some|something B|something A|
+---+---+----+-----------+-----------+