I am trying to pivot a dataframe with one key and multiple values across different columns . How do I do this in pyspark ? I have used pivot with one key value pair before and trying to figure this out.
Sample dataframe
id | test_id | test_status | key | score1 | score2 | score3 |
---|---|---|---|---|---|---|
ABC | 1 | complete | q1 | 1 | 2 | 3 |
ABC | 1 | complete | q2 | 4 | 5 | 6 |
ABC | 2 | complete | q1 | 1 | 6 | 7 |
ABC | 2 | complete | q2 | 5 | 6 | 7 |
expected dataframe
id | test_id | test_status | q1_score1 | q1_score2 | q1_score3 | q2_score1 | q2_score2 | q2_score3 |
---|---|---|---|---|---|---|---|---|
ABC | 1 | complete | 1 | 2 | 3 | 4 | 5 | 6 |
ABC | 2 | complete | 1 | 6 | 7 | 5 | 6 | 7 |
You can do multiple columns pivot.
df = (df.groupby('id', 'test_id', 'test_status')
.pivot('key')
.agg(*[F.first(x).alias(x) for x in ['score1', 'score2', 'score3']]))