Search code examples
pysparkapache-spark-sqldatabricks-sql

How to left shift column value in spark sql?


I have the following table

Team Team_Category Score1 Score2 Score3 Score4 Score5 Score6 Score7 Score8
A41 Bikes 40 91 51
A57 Bicycles 60 47

I want to left shift the values if a previous score is NULL or ""
The required output is:

Team Team_Category Score1 Score2 Score3 Score4 Score5 Score6 Score7 Score8
A41 Bikes 40 91 51
A57 Bicycles 60 47

I tried using coalesce like

spark.sql("""SELECT Team, Team_Category, COALESCE(NULLIF(Score1,""),NULLIF(Score2,""),...NULLIF(Score8,"")) AS Score1, COALESCE(NULLIF(Score2,""),NULLIF(Score3,"")...NULLIF(Score8,"")) AS Score2,...""")

but this would only solve half the problem. Even though the values will now be moved to the left, the values on the right will still be there.

How can this be done using spark sql?

Note: A team can have multiple rows, but they are independent from each other.


Solution

  • Create an array from the Score columns then use array_compact to remove nulls finally based on the new position of each element in the array reassign the scores columns back to the dataframe

    cols = [c for c in df.columns if c.startswith('Score')]
    
    arr = F.array_compact(F.array(*cols))
    df1 = df.select('Team', 'Team_Category', *[arr[i].alias(c) for i, c in enumerate(cols)])
    

    +----+-------------+------+------+------+------+------+------+------+------+
    |Team|Team_Category|Score1|Score2|Score3|Score4|Score5|Score6|Score7|Score8|
    +----+-------------+------+------+------+------+------+------+------+------+
    | A41|        Bikes|    40|    91|    51|  null|  null|  null|  null|  null|
    | A57|     Bicycles|    60|    47|  null|  null|  null|  null|  null|  null|
    +----+-------------+------+------+------+------+------+------+------+------+
    

    Note: array_compact is only available starting from spark 3.4.0 if you are using an old version then we can use filter insstead: arr = F.filter(F.array(*cols), lambda x: x.isNotNull())