I would like to concat two columns, but in a way that they are ordered. For example I have dataframe like this:
|-------------------|-----------------|
| column_1 | column_2 |
|-------------------|-----------------|
| aaa | bbb |
|-------------------|-----------------|
| bbb | aaa |
|-------------------|-----------------|
Returns a dataframe like this:
|-------------------|-----------------|-----------------|
| column_1 | column_2 | concated_cols |
|-------------------|-----------------|-----------------|
| aaa | bbb | aaabbb |
|-------------------|-----------------|-----------------|
| bbb | aaa | aaabbb |
|-------------------|-----------------|-----------------|
Version Spark >= 2.4
from pyspark.sql import functions as F
df.withColumn(
"concated_cols",
F.array_join(F.array_sort(F.array(F.col("column_1"), F.col("column_2"))), ""),
).show()
Spark <= 2.3 version. With a simple UDF :
from pyspark.sql import functions as F
@F.udf
def concat(*cols):
return "".join(sorted(cols))
df.withColumn("concated_cols", concat(F.col("column_1"), F.col("column_2"))).show()
+--------+--------+-------------+
|column_1|column_2|concated_cols|
+--------+--------+-------------+
| aaa| bbb| aaabbb|
| bbb| aaa| aaabbb|
+--------+--------+-------------+