I have some column names in a dataset that have three underscores ___ in the string. Using PySpark, I would like to remove all characters before the underscores including the underscores, and keep the remaining characters as column names. I need the code to dynamically rename column names instead of writing column names in the code. If ___ is at the start or end of the column name, it should only remove ___ and leave remaining characters as it is.
Example:
Input column names:
sequence_number
department
user___first_name
user___last_name
phone___mobile1
___city
state___
zip_code
Desired output column names:
sequence_number
department
first_name
last_name
mobile1
city
state
zip_code
Try with this:
import re
def normalize(col):
"""removes *___ from beginning or end of column names"""
col = col.rstrip("___")
return re.sub(r'^(.*___)(.*)$', r'\2', col)
# nozmalize column names in dataframe
df = df.toDF(*[normalize(c) for c in df.columns])