Search code examples
pythonpysparkheadermultiple-columnsrename

PySpark remove string before a character from all column names


I have some column names in a dataset that have three underscores ___ in the string. Using PySpark, I would like to remove all characters before the underscores including the underscores, and keep the remaining characters as column names. I need the code to dynamically rename column names instead of writing column names in the code. If ___ is at the start or end of the column name, it should only remove ___ and leave remaining characters as it is.

Example:

Input column names:

sequence_number   
department  
user___first_name  
user___last_name  
phone___mobile1
___city  
state___
zip_code

Desired output column names:

sequence_number   
department  
first_name  
last_name  
mobile1
city  
state
zip_code

Solution

  • Try with this:

    import re
    
    def normalize(col):
        """removes *___ from beginning or end of column names"""
        col = col.rstrip("___")
        return re.sub(r'^(.*___)(.*)$', r'\2', col)
    
    # nozmalize column names in dataframe
    df = df.toDF(*[normalize(c) for c in df.columns])