I have business scenario, where I have an id
column and value
columns.
There are multiple value
columns like value_1, value_2 etc.
The idea is to shift values of value
column to left.
The Input is as below:
Input
+------------+-----------+---------+----------+---------+---------+---------+
| id| value_1| value_2| value_3| value_4| value_5| value_6|
+------------+-----------+---------+----------+---------+---------+---------+
| 1| 1011| 1012| 1011| 2018| null| 1011|
| 2| null| 1012| null| 2018| null| 2022|
| 7| 1011| 1011| 1016| 2018| null| null|
| 8| 1011| 1012| 1014| 2018| null| null|
+------------+-----------+---------+----------+---------+---------+---------+
The next step is to replace null values with actual by shifting the columns towards left. The idea is to all the valid valued columns should be kept to left and columns having null
to be shfited right.
The intermedidate output looks as below:
Output1
+------------+-----------+---------+----------+---------+---------+---------+
| id| value_1| value_2| value_3| value_4| value_5| value_6|
+------------+-----------+---------+----------+---------+---------+---------+
| 1| 1011| 1012| 1011| 2018| 1011| null|
| 2| 1012| 2018| 2022| null| null| null|
| 7| 1011| 1011| 1016| 2018| null| null|
| 8| 1011| 1012| 1014| 2018| null| null|
+------------+-----------+---------+----------+---------+---------+---------+
Then, the duplicate values of rows must be removed and if possible the valid values must be shifted further left. The sample output looks as below:
Output2
+------------+-----------+---------+----------+---------+---------+---------+
| id| value_1| value_2| value_3| value_4| value_5| value_6|
+------------+-----------+---------+----------+---------+---------+---------+
| 1| 1011| 1012| 2018| null| null| null|
| 2| 1012| 2018| 2022| null| null| null|
| 7| 1011| 1016| 2018| null| null| null|
| 8| 1011| 1012| 1014| 2018| null| null|
+------------+-----------+---------+----------+---------+---------+---------+
Any leads on this would be helpful. Thanks.
cols = ['value_1', 'value_2', 'value_3',
'value_4', 'value_5', 'value_6']
arr = F.array_compact(F.array_distinct(F.array(*cols)))
df1 = df.select('id', *[arr[i].alias(c) for i, c in enumerate(cols)])
Create an array from the value columns then remove the duplicate values using array_distinct
then use array_compact
to drop the null values. Finally use a select expression to yank the cols from the resulting array and assign them back to the dataframe.
df1.show()
+---+-------+-------+-------+-------+-------+-------+
|id |value_1|value_2|value_3|value_4|value_5|value_6|
+---+-------+-------+-------+-------+-------+-------+
|1 |1011 |1012 |2018 |null |null |null |
|2 |1012 |2018 |2022 |null |null |null |
|7 |1011 |1016 |2018 |null |null |null |
|8 |1011 |1012 |1014 |2018 |null |null |
+---+-------+-------+-------+-------+-------+-------+