My input is
'Apple-Pie-1024'
Desired output is
'Apple-Pie'
I can do this using python rsplit()
method as
s.rsplit('-',1)[0].strip()
which will give me the desired output
how can i get similar output with spark sql? Thanks in advance
You can achieve this by using an UDF , the python code snippet that you created
split_udf = F.udf(lambda x : x.rsplit('-',1)[0].strip(), StringType())
input_list = [
(1,"Apple-Pie-1024")
]
sparkDF = sql.createDataFrame(input_list,['num','col1'])
sparkDF = sparkDF.withColumn('r_split',split_udf(F.col('col1')))
sparkDF.show()
+---+--------------+---------+
|num| col1| r_split|
+---+--------------+---------+
| 1|Apple-Pie-1024|Apple-Pie|
+---+--------------+---------+
sparkDF = sparkDF.withColumn('r_split'
,F.concat(F.split(F.col('col1'),'-').getItem(0)
,F.lit('-')
,F.split(F.col('col1'),'-').getItem(1)
)
)
sparkDF.show()
+---+--------------+---------+
|num| col1| r_split|
+---+--------------+---------+
| 1|Apple-Pie-1024|Apple-Pie|
+---+--------------+---------+
Built-in SQL functions are preferred (also here) because your data does not get passed back and forth between the JVM process and the Python process, which is what happens when you use a UDF.
This solution is intended to provide you a way to tackle data inconsistency and the way to approach it within the Spark Env.
You can update the split_udf
to tackle edge cases as below -
#### REGEX to extract only characters from the string
split_udf = F.udf(lambda x : "-".join(re.findall("[a-zA-Z]+", x)), StringType())
input_list = [
(1,"Apple-Pie-1024")
,(2,"ApplePie-12345")
]
sparkDF = sql.createDataFrame(input_list,['num','col1'])
sparkDF = sparkDF.withColumn('r_split',split_udf(F.col('col1')))
sparkDF.show()
+---+--------------+---------+
|num| col1| r_split|
+---+--------------+---------+
| 1|Apple-Pie-1024|Apple-Pie|
| 2|ApplePie-12345| ApplePie|
+---+--------------+---------+