Search code examples
apache-sparkpysparkapache-spark-sqlnlppunctuation

How to remove punctuation from a text?


I have a very big data set . I am wondering How I can remove all punctuation from a big dataset in pyspark? For example , . & \ | - _


Solution

  • You can use regexp_replace to remove the punctuations you specified using a regex expression:

    import pyspark.sql.functions as F
    
    df2 = df.select(
        [F.regexp_replace(col, r',|\.|&|\\|\||-|_', '').alias(col) for col in df.columns]
    )