Search code examples
concatenationpyspark

PySpark concat two columns in order


I would like to concat two columns, but in a way that they are ordered. For example I have dataframe like this:

|-------------------|-----------------|
|      column_1     |     column_2    |
|-------------------|-----------------|
|          aaa      |        bbb      |
|-------------------|-----------------|
|          bbb      |        aaa      |
|-------------------|-----------------|

Returns a dataframe like this:

|-------------------|-----------------|-----------------|
|      column_1     |     column_2    |  concated_cols  |
|-------------------|-----------------|-----------------|
|         aaa       |       bbb       |    aaabbb       |
|-------------------|-----------------|-----------------|
|         bbb       |       aaa       |    aaabbb       |
|-------------------|-----------------|-----------------|


Solution

  • Version Spark >= 2.4

    from pyspark.sql import functions as F
    
    df.withColumn(
        "concated_cols",
        F.array_join(F.array_sort(F.array(F.col("column_1"), F.col("column_2"))), ""),
    ).show()
    

    Spark <= 2.3 version. With a simple UDF :

    from pyspark.sql import functions as F
    
    @F.udf
    def concat(*cols):
        return "".join(sorted(cols))
    
    
    df.withColumn("concated_cols", concat(F.col("column_1"), F.col("column_2"))).show()
    +--------+--------+-------------+
    |column_1|column_2|concated_cols|
    +--------+--------+-------------+
    |     aaa|     bbb|       aaabbb|
    |     bbb|     aaa|       aaabbb|
    +--------+--------+-------------+