Search code examples
pysparkdatabricks

Add quote for pyspark dataframe column with regular expressions


I have below column within dataframe

tuff,1,2,3,bp123,5,6,7,jatin gupta ,ext,20021988

I require to use regular expression within pyspark to add double quote after 8 comma ( if double quote not present already) and add ending double quote before digits 20021988

Expected output:

tuff,1,2,3,bp123,5,6,7,"jatin gupta ,ext",20021988

I have tried with below pattern but doesnt work

data = [("TUFF,2,3,BP4,5,6,7,JATIN GUPTA, EXT, 20021988",)]
columns = ["input_string"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Use regular expression to add quotes after the 7th comma and before the 8 digits
df = df.withColumn("output_string", regexp_replace(col("input_string"), r"((?:[^,]*,){7})([^,]*,[^,]*),([^,]*)", r'\1"\2", \3'))

Solution

  • This will do the trick:

    import pyspark.sql.functions as F
    df.withColumn(
        "res", 
        F.regexp_replace(
            F.col("input_string"),
            r"(([^,]*,){7})(.*),([^,]*)",
            r"$1\"$3\",$4",
        )
    )
    

    result:

    +---------------------------------------------+-----------------------------------------------+
    |input_string                                 |res                                            |
    +---------------------------------------------+-----------------------------------------------+
    |TUFF,2,3,BP4,5,6,7,JATIN GUPTA, EXT, 20021988|TUFF,2,3,BP4,5,6,7,"JATIN GUPTA, EXT", 20021988|
    +---------------------------------------------+-----------------------------------------------+
    

    Few notes:

    1. Group reference in spark is achieved through dollar sign, not slash (I can think of several systems where it's similar - worth generalizing, that it's not always slash)
    2. You have pretty complex definition, containing complete cell - I'd avoid lookahead and lookback regex operators - just define complete cell in groups, then work on individual groups.
    3. Utilize fact that regex is greedy on default - i.e. group to be included within quotes can be treated as whatever terminated with single comma, and whatever except comma.