I have below column within dataframe
tuff,1,2,3,bp123,5,6,7,jatin gupta ,ext,20021988
I require to use regular expression within pyspark to add double quote after 8 comma ( if double quote not present already) and add ending double quote before digits 20021988
Expected output:
tuff,1,2,3,bp123,5,6,7,"jatin gupta ,ext",20021988
I have tried with below pattern but doesnt work
data = [("TUFF,2,3,BP4,5,6,7,JATIN GUPTA, EXT, 20021988",)]
columns = ["input_string"]
# Create DataFrame
df = spark.createDataFrame(data, columns)
# Use regular expression to add quotes after the 7th comma and before the 8 digits
df = df.withColumn("output_string", regexp_replace(col("input_string"), r"((?:[^,]*,){7})([^,]*,[^,]*),([^,]*)", r'\1"\2", \3'))
This will do the trick:
import pyspark.sql.functions as F
df.withColumn(
"res",
F.regexp_replace(
F.col("input_string"),
r"(([^,]*,){7})(.*),([^,]*)",
r"$1\"$3\",$4",
)
)
result:
+---------------------------------------------+-----------------------------------------------+
|input_string |res |
+---------------------------------------------+-----------------------------------------------+
|TUFF,2,3,BP4,5,6,7,JATIN GUPTA, EXT, 20021988|TUFF,2,3,BP4,5,6,7,"JATIN GUPTA, EXT", 20021988|
+---------------------------------------------+-----------------------------------------------+
Few notes: