I need to remove the repeated punctuations and keep the last occurrence only.
For example: !!!! -> !
!!$$ -> !$
I have a dataset that looks like below
temp = spark.createDataFrame([
(0, "This is Spark!!!!"),
(1, "I wish Java could use case classes!!##"),
(2, "Data science is cool#$@!"),
(3, "Machine!!$$")
], ["id", "words"])
+---+--------------------------------------+
|id |words |
+---+--------------------------------------+
|0 |This is Spark!!!! |
|1 |I wish Java could use case classes!!##|
|2 |Data science is cool#$@! |
|3 |Machine!!$$ |
+---+--------------------------------------+
I tried regex to remove specific punctuations and that is below
df2 = temp.select(
[F.regexp_replace(col, r',|\.|&|\\|\||-|_', '').alias(col) for col in temp.columns]
)
but the above is not working. Can anyone tell how to achieve this in pyspark?
Below is the desired output.
id words
0 0 This is Spark!
1 1 I wish Java could use case classes!#
2 2 Data science is cool#$@!
3 3 Machine!$
You can use this regex.
df2 = temp.select('id',
F.regexp_replace('words', r'([!$#])\1+', '$1').alias('words'))
Regex explanation.
( -> Group anything between this and ) and create a capturing group
[ -> Match any characters between this and ]
([!$#]) -> Create the capturing group that match any of !, $, #
\1 -> Reference the first capturing group
+ -> Match 1 or more of a preceding group or character
([!$#])\1+ -> Match any of !, $, # that repeats more than 1 time.
And the last argument of regex_replace
to set $1
which is referencing the first capturing group (a single character of !, $, #) to replace the repeating characters with just the single character.
You can add more characters between []
for matching more special characters.