Search code examples
pythonapache-sparkpyspark

Shifting Columns values towards left based on the value in them


I have business scenario, where I have an id column and value columns. There are multiple value columns like value_1, value_2 etc.

The idea is to shift values of value column to left.

The Input is as below:

Input
+------------+-----------+---------+----------+---------+---------+---------+
|          id|    value_1|  value_2|   value_3|  value_4|  value_5|  value_6|
+------------+-----------+---------+----------+---------+---------+---------+
|           1|       1011|     1012|      1011|     2018|     null|     1011|
|           2|       null|     1012|      null|     2018|     null|     2022|
|           7|       1011|     1011|      1016|     2018|     null|     null|
|           8|       1011|     1012|      1014|     2018|     null|     null|
+------------+-----------+---------+----------+---------+---------+---------+

The next step is to replace null values with actual by shifting the columns towards left. The idea is to all the valid valued columns should be kept to left and columns having null to be shfited right. The intermedidate output looks as below:

Output1
+------------+-----------+---------+----------+---------+---------+---------+
|          id|    value_1|  value_2|   value_3|  value_4|  value_5|  value_6|
+------------+-----------+---------+----------+---------+---------+---------+
|           1|       1011|     1012|      1011|     2018|     1011|     null|
|           2|       1012|     2018|      2022|     null|     null|     null|
|           7|       1011|     1011|      1016|     2018|     null|     null|
|           8|       1011|     1012|      1014|     2018|     null|     null|
+------------+-----------+---------+----------+---------+---------+---------+

Then, the duplicate values of rows must be removed and if possible the valid values must be shifted further left. The sample output looks as below:

Output2
+------------+-----------+---------+----------+---------+---------+---------+
|          id|    value_1|  value_2|   value_3|  value_4|  value_5|  value_6|
+------------+-----------+---------+----------+---------+---------+---------+
|           1|       1011|     1012|      2018|     null|     null|     null|
|           2|       1012|     2018|      2022|     null|     null|     null|
|           7|       1011|     1016|      2018|     null|     null|     null|
|           8|       1011|     1012|      1014|     2018|     null|     null|
+------------+-----------+---------+----------+---------+---------+---------+

Any leads on this would be helpful. Thanks.


Solution

  • Code

    cols = ['value_1', 'value_2', 'value_3',
            'value_4', 'value_5', 'value_6']
    
    arr = F.array_compact(F.array_distinct(F.array(*cols)))
    df1 = df.select('id', *[arr[i].alias(c) for i, c in enumerate(cols)])
    

    How this works

    Create an array from the value columns then remove the duplicate values using array_distinct then use array_compact to drop the null values. Finally use a select expression to yank the cols from the resulting array and assign them back to the dataframe.

    Result

    df1.show()
    
    +---+-------+-------+-------+-------+-------+-------+
    |id |value_1|value_2|value_3|value_4|value_5|value_6|
    +---+-------+-------+-------+-------+-------+-------+
    |1  |1011   |1012   |2018   |null   |null   |null   |
    |2  |1012   |2018   |2022   |null   |null   |null   |
    |7  |1011   |1016   |2018   |null   |null   |null   |
    |8  |1011   |1012   |1014   |2018   |null   |null   |
    +---+-------+-------+-------+-------+-------+-------+