I have the following dataframe.
The values
column contains multiple time string values and I want to sort these values as ascending manner as shown in a after part below.
Before
id values
1 2023-04-02, 2023-02-01, 2023-01-04
2 2023-02-01, 2023-01-04, 2023-02-07
3 2023-01-08, 2023-01-05, 2023-01-02
After
id values
1 2023-01-04, 2023-02-01, 2023-04-02
2 2023-01-04, 2023-02-01, 2023-02-07
3 2023-01-02, 2023-01-05, 2023-01-08
I am using the pyspark and how to achieve it?
Thanks.
Your DataFrame (df_1):
+---+----------------------------------+
|id |value |
+---+----------------------------------+
|1 |2023-04-02, 2023-02-01, 2023-01-04|
|2 |2023-02-01, 2023-01-04, 2023-02-07|
|3 |2023-01-08, 2023-01-05, 2023-01-02|
+---+----------------------------------+
Import necessary functions
from pyspark.sql.functions import split, array_sort, concat_ws
split()
to separate string column value based on delimiterdf_2 = df_1.withColumn("value", split("value", ", "))
array_sort()
to sort your elementdf_3 = df_2.withColumn("value", array_sort("value"))
concat_ws()
to join array element into string type with a value delimitordf_4 = df_3.withColumn("value", concat_ws(", ", "value"))
df_4.show(truncate=False)
Output:
+---+----------------------------------+
|id |value |
+---+----------------------------------+
|1 |2023-01-04, 2023-02-01, 2023-04-02|
|2 |2023-01-04, 2023-02-01, 2023-02-07|
|3 |2023-01-02, 2023-01-05, 2023-01-08|
+---+----------------------------------+