Search code examples
regexpyspark

Pyspark Regular Expression add double quotes after comma


I have below string present within dataframe:

30,kUsUO,6,18,97,42,SAM,lmhYK,49,aLaTA,51,34,3,49,75,39,pdwvW,54,7,63,12,25,26,SJ12u,rUFUV,34,xXBv3,XHtz4,r4Fyh,14,20,0jZL2,izrsC,44,K5Kw3,8,tcKu7,5,RPLcy,kg4IR,Kvs3p,lyG09,dJmZB,34,84,7,qED2y,8uNen,5,96,81,88,bGgqK,FAsIV,81,YXZ,PQR,Flat No B1002, Balaji Whitefield society, sus road, pune,Mh,22,591213,LbAo7,21,18,text,,,,,

Requirement here to add double quote after 57th comma if string/digits immidiately present after 57th comma and close double quote before pattern digits,digits (here before ,22,591213)

So basically trying to enclose below substring within double quote
"Flat No B1002, Balaji Whitefield society, sus road, pune,Mh"

For that I have written below regular expression

Pattern=r"^((?:[^,]*,){57})(\"?[a-zA-Z_][^\"]*?\"?)(,\d{2},\d{4}.*)$"

df = df.withColumn("text", regexp_replace(col("text"), pattern, r'$1"$2"$3'))

This regular expression works very well for above string.

But If i get variation in string , example below then count for comma goes wrong

30,kUsUO,6,18,97,42,"SAM,K,KARAN" lmhYK,49,aLaTA,51,34,3,49,75,39,pdwvW,54,7,63,12,25,26,SJ12u,rUFUV,34,xXBv3,XHtz4,r4Fyh,14,20,0jZL2,izrsC,44,K5Kw3,8,tcKu7,5,RPLcy,kg4IR,Kvs3p,lyG09,dJmZB,34,84,7,qED2y,8uNen,5,96,81,88,bGgqK,FAsIV,81,YXZ,PQR,Flat No B1002, Balaji Whitefield society, sus road, pune,Mh,22,591213,LbAo7,21,18,text,,,,,

Here name appears within double quote with comma "SAM,K,KARAN" Due to this my count for comma goes wrong

Is there any way to modify above regular expression in pyspark so that expression will not consider comma if its present within double quote.

This double quote comma case appears any number of times and places.


Solution

  • You might change the regex to:

    ^((?:[^,"]*(?:"[^"]*"[^,"]*)*,){57})("?[a-zA-Z_][^"]*"?)(,\d{2},\d{4}.*)$
    

    The different groups match:

    • ^ Start of string
    • ( Capture group 1
    • (?: Non capture group to repeat as a whole
      • [^,"]* Match optional chars other than "
      • (?:"[^"]*"[^,"]*)*, Optionally repeat matching "..." followed by optional chars other than , and then match a ,
    • ){57} Close the non capture group and repeat 57 times
    • ) Close group 1
    • ( Capture group 2
      • "? Match an optional "
      • [a-zA-Z_] Match a single char specified in the character class
      • [^"]*"? Match optional chars other than " and then match an optional "
    • ) Close group 2
    • ( Capture group 3
      • ,\d{2},\d{4}.* Match a comma, 2 digits, comma, 4 digits and then the rest of the line
    • ) Close group 3
    • $ End of string

    See a regex demo