Search code examples
apache-sparkpysparkapache-spark-sqlspecial-charactersstr-replace

PySpark remove special characters in all column names for all special characters


I am trying to remove all special characters from all the columns. I am using the following commands:

import pyspark.sql.functions as F

df_spark = spark_df.select([F.col(col).alias(col.replace(' ', '_')) for col in df.columns])
df_spark1 = df_spark.select([F.col(col).alias(col.replace('%', '_')) for col in df_spark.columns])
df_spark = df_spark1.select([F.col(col).alias(col.replace(',', '_')) for col in df_spark1.columns])
df_spark1 = df_spark.select([F.col(col).alias(col.replace('(', '_')) for col in df_spark.columns])
df_spark2 = df_spark1.select([F.col(col).alias(col.replace(')', '_')) for col in df_spark1.columns])

Is there an easier way of replacing all special characters (not just the 5 above) in just one command? I am using PySpark on Databricks.


Solution

  • You can substitute any character except A-z and 0-9

    import pyspark.sql.functions as F
    import re
    
    df = df.select([F.col(column_name).alias(re.sub("[^0-9a-zA-Z$]+", "", column_name)) for column_name in df.columns])