Search code examples
pysparkjohnsnowlabs-spark-nlp

Remove the repeated punctuation from pyspark dataframe


I need to remove the repeated punctuations and keep the last occurrence only.

For example: !!!! -> !
             !!$$ -> !$

I have a dataset that looks like below

temp = spark.createDataFrame([
    (0, "This is Spark!!!!"),
    (1, "I wish Java could use case classes!!##"),
    (2, "Data science is  cool#$@!"),
    (3, "Machine!!$$")
], ["id", "words"])

+---+--------------------------------------+
|id |words                                 |
+---+--------------------------------------+
|0  |This is Spark!!!!                     |
|1  |I wish Java could use case classes!!##|
|2  |Data science is  cool#$@!             |
|3  |Machine!!$$                             |
+---+--------------------------------------+

I tried regex to remove specific punctuations and that is below

df2 = temp.select(
    [F.regexp_replace(col, r',|\.|&|\\|\||-|_', '').alias(col) for col in temp.columns]
)

but the above is not working. Can anyone tell how to achieve this in pyspark?

Below is the desired output.

    id  words
0   0   This is Spark!
1   1   I wish Java could use case classes!#
2   2   Data science is cool#$@!
3   3   Machine!$

Solution

  • You can use this regex.

    df2 = temp.select('id',
        F.regexp_replace('words', r'([!$#])\1+', '$1').alias('words'))
    

    Regex explanation.

    (   -> Group anything between this and ) and create a capturing group
    [   -> Match any characters between this and ]
    
    ([!$#]) -> Create the capturing group that match any of !, $, #
    
    \1  -> Reference the first capturing group
    +   -> Match 1 or more of a preceding group or character
    
    ([!$#])\1+ -> Match any of !, $, # that repeats more than 1 time.
    

    And the last argument of regex_replace to set $1 which is referencing the first capturing group (a single character of !, $, #) to replace the repeating characters with just the single character.

    You can add more characters between [] for matching more special characters.