Search code examples
pysparkconcatenation

How to concatenate two columns in Pyspark but affect the order in which individual rows are concatenated?


I have a pyspark data frame which looks like the sample below, ignoring the column called name, it is what I need to create out of columns side 1 and side 2. The issue is that the name needs to be the same regardless of whether side 1 contains A and side 2 B or side 1 B and side 2 A; the final result needs to be A/B so simply concating the two columns does not work because the names would look different A/B - B/A. I need to do a groupby further down the line using the name column which is why I need the names to be reflected correctly. Could someone please point me to a way I can use to do this.

df = pd.DataFrame({"column1":['Value1','Value2','Value3','Value4','Value5','Value6','Value7','Value8','Value9','Value10','Value11','Value12','Value13','Value14','Value 15','Value16'],\
"column2":['Value1','Value2','Value3','Value4','Value5','Value6','Value7','Value8','Value9','Value10','Value11','Value12','Value13','Value14','Value 15','Value16'],\
"column3":['Value1','Value2','Value3','Value4','Value5','Value6','Value7','Value8','Value9','Value10','Value11','Value12','Value13','Value14','Value 15','Value16'],\
"column4":['Value1','Value2','Value3','Value4','Value5','Value6','Value7','Value8','Value9','Value10','Value11','Value12','Value13','Value14','Value 15','Value16'],\
"side1":['A','B','A','K','C','G','F','F','M','N','S','K','K','N','RE','RA'],\
"side2":['B','A','B','M','H','H','Y','Y','N','M','F','N','N','K','RA','RE']})
mySchema = StructType([StructField("column 1", StringType(), True), \
StructField("column 2", StringType(), True), StructField("column 3", StringType(), True), \
StructField("column 4", StringType(), True), StructField("side 1", StringType(), True), \
StructField("side 2", StringType(), True)])

df = spark.createDataFrame(df,schema=mySchema)

enter image description here


Solution

  • You can use sort_array to alphabetically order the combination of side 1 and side 2 and then concatenate them.

    from pyspark.sql.functions import *
    df.withColumn("name", concat_ws("/", sort_array(array("side 1", "side 2")))).show()
    
    +--------+--------+--------+--------+------+------+-----+                       
    |column 1|column 2|column 3|column 4|side 1|side 2| name|
    +--------+--------+--------+--------+------+------+-----+
    | Value 1| Value 1| Value 1| Value 1|     A|     B|  A/B|
    | Value 2| Value 2| Value 2| Value 2|     B|     A|  A/B|
    | Value 3| Value 3| Value 3| Value 3|     A|     B|  A/B|
    | Value 4| Value 4| Value 4| Value 4|     K|     M|  K/M|
    | Value 5| Value 5| Value 5| Value 5|     C|     H|  C/H|
    | Value 6| Value 6| Value 6| Value 6|     G|     H|  G/H|
    | Value 7| Value 7| Value 7| Value 7|     F|     Y|  F/Y|
    | Value 8| Value 8| Value 8| Value 8|     F|     Y|  F/Y|
    | Value 9| Value 9| Value 9| Value 9|     M|     N|  M/N|
    |Value 10|Value 10|Value 10|Value 10|     N|     M|  M/N|
    |Value 11|Value 11|Value 11|Value 11|     S|     F|  F/S|
    |Value 12|Value 12|Value 12|Value 12|     K|     N|  K/N|
    |Value 13|Value 13|Value 13|Value 13|     K|     N|  K/N|
    |Value 14|Value 14|Value 14|Value 14|     N|     K|  K/N|
    |Value 15|Value 15|Value 15|Value 15|    RE|    RA|RA/RE|
    |Value 16|Value 16|Value 16|Value 16|    RA|    RE|RA/RE|
    +--------+--------+--------+--------+------+------+-----+