I have a very big data set . I am wondering How I can remove all punctuation from a big dataset in pyspark? For example , . & \ | - _
You can use regexp_replace
to remove the punctuations you specified using a regex expression:
import pyspark.sql.functions as F
df2 = df.select(
[F.regexp_replace(col, r',|\.|&|\\|\||-|_', '').alias(col) for col in df.columns]
)