Search code examples
pythonapache-sparkapache-spark-sql

How to perform python rsplit() in spark sql or split on last occurrence of delimiter in spark sql?


My input is

'Apple-Pie-1024'

Desired output is

'Apple-Pie'

I can do this using python rsplit() method as

s.rsplit('-',1)[0].strip()

which will give me the desired output

how can i get similar output with spark sql? Thanks in advance


Solution

  • You can achieve this by using an UDF , the python code snippet that you created

    UDF

    split_udf = F.udf(lambda x : x.rsplit('-',1)[0].strip(), StringType())
    
    input_list = [
      (1,"Apple-Pie-1024")
    ]
    
    sparkDF = sql.createDataFrame(input_list,['num','col1'])
    
    sparkDF = sparkDF.withColumn('r_split',split_udf(F.col('col1')))
    
    sparkDF.show()
    
    +---+--------------+---------+
    |num|          col1|  r_split|
    +---+--------------+---------+
    |  1|Apple-Pie-1024|Apple-Pie|
    +---+--------------+---------+
    

    Built in Spark SQL -

    sparkDF = sparkDF.withColumn('r_split'
                                     ,F.concat(F.split(F.col('col1'),'-').getItem(0)
                                               ,F.lit('-')
                                               ,F.split(F.col('col1'),'-').getItem(1)
                                    )
                                )
    
    sparkDF.show()
    
    +---+--------------+---------+
    |num|          col1|  r_split|
    +---+--------------+---------+
    |  1|Apple-Pie-1024|Apple-Pie|
    +---+--------------+---------+
    
    

    Built-in SQL functions are preferred (also here) because your data does not get passed back and forth between the JVM process and the Python process, which is what happens when you use a UDF.

    Quick Fix For Desired output

    This solution is intended to provide you a way to tackle data inconsistency and the way to approach it within the Spark Env.

    You can update the split_udf to tackle edge cases as below -

    #### REGEX to extract only characters from the string
    split_udf = F.udf(lambda x : "-".join(re.findall("[a-zA-Z]+", x)), StringType())
    
    input_list = [
      (1,"Apple-Pie-1024")
     ,(2,"ApplePie-12345")
    ]
    
    sparkDF = sql.createDataFrame(input_list,['num','col1'])
    
    sparkDF = sparkDF.withColumn('r_split',split_udf(F.col('col1')))
    
    sparkDF.show()
    
    +---+--------------+---------+
    |num|          col1|  r_split|
    +---+--------------+---------+
    |  1|Apple-Pie-1024|Apple-Pie|
    |  2|ApplePie-12345| ApplePie|
    +---+--------------+---------+