Search code examples
pythondataframeapache-sparkpysparkapache-spark-sql

add character at character count in pyspark


I'm looking for a way to insert special character at a specific character count in a string in pyspark :

"M202876QC0581AADMM01"
to
"M-202876-QC0581-AA-DMM01"

(1-6-6-2-)
insertion after 1char then after 6char then after 6char then after 2char

Tried something like below but no luck yet :

df = spark.createDataFrame([('M202876QC0581AADMM01',)], ['str'])
(df.withColumn("str", F.regexp_replace(F.col("str") ,  r"(\d{0})(\d{3})(\d{3})" , "$1-$2-$3"))).collect()

Out[121]: [Row(str='M-202-876QC0581AADMM01')]

Solution

  • You're close, try this :

    from pyspark.sql.functions import regexp_replace
    
    df = spark.createDataFrame([("M202876QC0581AADMM01",)], ["str"])
    
    pat = r"^(.{1})(.{6})(.{6})(.{2})(.+)"
    df = df.withColumn("str", regexp_replace("str", pat, r"$1-$2-$3-$4-$5"))
    

    Output :

    df.show(truncate=False)
    
    +------------------------+
    |str                     |
    +------------------------+
    |M-202876-QC0581-AA-DMM01|
    +------------------------+