I have a pyspark data frame which looks like the sample below, ignoring the column called name, it is what I need to create out of columns side 1 and side 2. The issue is that the name needs to be the same regardless of whether side 1 contains A and side 2 B or side 1 B and side 2 A; the final result needs to be A/B so simply concating the two columns does not work because the names would look different A/B - B/A. I need to do a groupby further down the line using the name column which is why I need the names to be reflected correctly. Could someone please point me to a way I can use to do this.
df = pd.DataFrame({"column1":['Value1','Value2','Value3','Value4','Value5','Value6','Value7','Value8','Value9','Value10','Value11','Value12','Value13','Value14','Value 15','Value16'],\
"column2":['Value1','Value2','Value3','Value4','Value5','Value6','Value7','Value8','Value9','Value10','Value11','Value12','Value13','Value14','Value 15','Value16'],\
"column3":['Value1','Value2','Value3','Value4','Value5','Value6','Value7','Value8','Value9','Value10','Value11','Value12','Value13','Value14','Value 15','Value16'],\
"column4":['Value1','Value2','Value3','Value4','Value5','Value6','Value7','Value8','Value9','Value10','Value11','Value12','Value13','Value14','Value 15','Value16'],\
"side1":['A','B','A','K','C','G','F','F','M','N','S','K','K','N','RE','RA'],\
"side2":['B','A','B','M','H','H','Y','Y','N','M','F','N','N','K','RA','RE']})
mySchema = StructType([StructField("column 1", StringType(), True), \
StructField("column 2", StringType(), True), StructField("column 3", StringType(), True), \
StructField("column 4", StringType(), True), StructField("side 1", StringType(), True), \
StructField("side 2", StringType(), True)])
df = spark.createDataFrame(df,schema=mySchema)
You can use sort_array
to alphabetically order the combination of side 1
and side 2
and then concatenate them.
from pyspark.sql.functions import *
df.withColumn("name", concat_ws("/", sort_array(array("side 1", "side 2")))).show()
+--------+--------+--------+--------+------+------+-----+
|column 1|column 2|column 3|column 4|side 1|side 2| name|
+--------+--------+--------+--------+------+------+-----+
| Value 1| Value 1| Value 1| Value 1| A| B| A/B|
| Value 2| Value 2| Value 2| Value 2| B| A| A/B|
| Value 3| Value 3| Value 3| Value 3| A| B| A/B|
| Value 4| Value 4| Value 4| Value 4| K| M| K/M|
| Value 5| Value 5| Value 5| Value 5| C| H| C/H|
| Value 6| Value 6| Value 6| Value 6| G| H| G/H|
| Value 7| Value 7| Value 7| Value 7| F| Y| F/Y|
| Value 8| Value 8| Value 8| Value 8| F| Y| F/Y|
| Value 9| Value 9| Value 9| Value 9| M| N| M/N|
|Value 10|Value 10|Value 10|Value 10| N| M| M/N|
|Value 11|Value 11|Value 11|Value 11| S| F| F/S|
|Value 12|Value 12|Value 12|Value 12| K| N| K/N|
|Value 13|Value 13|Value 13|Value 13| K| N| K/N|
|Value 14|Value 14|Value 14|Value 14| N| K| K/N|
|Value 15|Value 15|Value 15|Value 15| RE| RA|RA/RE|
|Value 16|Value 16|Value 16|Value 16| RA| RE|RA/RE|
+--------+--------+--------+--------+------+------+-----+