Search code examples
pyspark

how to sort the values comma delimited using PySpark?


I have the following dataframe. The values column contains multiple time string values and I want to sort these values as ascending manner as shown in a after part below.

Before

id    values
1    2023-04-02, 2023-02-01, 2023-01-04
2    2023-02-01, 2023-01-04, 2023-02-07
3    2023-01-08, 2023-01-05, 2023-01-02

After

id    values
1    2023-01-04, 2023-02-01, 2023-04-02
2    2023-01-04, 2023-02-01, 2023-02-07
3    2023-01-02, 2023-01-05, 2023-01-08

I am using the pyspark and how to achieve it?

Thanks.


Solution

  • Your DataFrame (df_1):

    +---+----------------------------------+
    |id |value                             |
    +---+----------------------------------+
    |1  |2023-04-02, 2023-02-01, 2023-01-04|
    |2  |2023-02-01, 2023-01-04, 2023-02-07|
    |3  |2023-01-08, 2023-01-05, 2023-01-02|
    +---+----------------------------------+
    

    Import necessary functions

    from pyspark.sql.functions import split, array_sort, concat_ws
    
    1. Use split() to separate string column value based on delimiter
    df_2 = df_1.withColumn("value", split("value", ", "))
    
    1. Use array_sort() to sort your element
    df_3 = df_2.withColumn("value", array_sort("value"))
    
    1. Use concat_ws() to join array element into string type with a value delimitor
    df_4 = df_3.withColumn("value", concat_ws(", ", "value"))
    
    df_4.show(truncate=False)
    

    Output:

    +---+----------------------------------+
    |id |value                             |
    +---+----------------------------------+
    |1  |2023-01-04, 2023-02-01, 2023-04-02|
    |2  |2023-01-04, 2023-02-01, 2023-02-07|
    |3  |2023-01-02, 2023-01-05, 2023-01-08|
    +---+----------------------------------+