I have the following table
Team | Team_Category | Score1 | Score2 | Score3 | Score4 | Score5 | Score6 | Score7 | Score8 |
---|---|---|---|---|---|---|---|---|---|
A41 | Bikes | 40 | 91 | 51 | |||||
A57 | Bicycles | 60 | 47 |
I want to left shift the values if a previous score is NULL
or ""
The required output is:
Team | Team_Category | Score1 | Score2 | Score3 | Score4 | Score5 | Score6 | Score7 | Score8 |
---|---|---|---|---|---|---|---|---|---|
A41 | Bikes | 40 | 91 | 51 | |||||
A57 | Bicycles | 60 | 47 |
I tried using coalesce
like
spark.sql("""SELECT Team, Team_Category, COALESCE(NULLIF(Score1,""),NULLIF(Score2,""),...NULLIF(Score8,"")) AS Score1, COALESCE(NULLIF(Score2,""),NULLIF(Score3,"")...NULLIF(Score8,"")) AS Score2,...""")
but this would only solve half the problem. Even though the values will now be moved to the left, the values on the right will still be there.
How can this be done using spark sql?
Note: A team can have multiple rows, but they are independent from each other.
Create an array from the Score
columns then use array_compact
to remove nulls finally based on the new position of each element in the array reassign the scores columns back to the dataframe
cols = [c for c in df.columns if c.startswith('Score')]
arr = F.array_compact(F.array(*cols))
df1 = df.select('Team', 'Team_Category', *[arr[i].alias(c) for i, c in enumerate(cols)])
+----+-------------+------+------+------+------+------+------+------+------+
|Team|Team_Category|Score1|Score2|Score3|Score4|Score5|Score6|Score7|Score8|
+----+-------------+------+------+------+------+------+------+------+------+
| A41| Bikes| 40| 91| 51| null| null| null| null| null|
| A57| Bicycles| 60| 47| null| null| null| null| null| null|
+----+-------------+------+------+------+------+------+------+------+------+
Note: array_compact
is only available starting from spark 3.4.0 if you are using an old version then we can use filter
insstead: arr = F.filter(F.array(*cols), lambda x: x.isNotNull())