Search code examples
pythonpandasrenamestrip

Regular expression to rename the column by stripping the column name


I have df with many columns and each column have repeated values because its survey data. As an example my data look like this:

df:

 Q36r9: sales platforms - Before purchasing a new car         Q36r32: Advertising letters - Before purchasing a new car
        Not Selected                                                                         Selected

So i want to strip the text from column names. For example from first column I want to get the text between ":" and "-". So it should be like this: "sales platform" and in second part i want to convert vales of column, "selected" should be changed with the name of column and "Not Selected" as NaN

so desired output would be like this:

sales platforms                                       Advertising letters
      NaN                                             Advertising letters

Edited: Another Problem if i have column name like:

Q40r1c3: WeChat - Looking for a new car - And now if you think again  - Which social media platforms or sources would you use in each situation?

If i just want to get something in between ":" and "-". It should extract "WeChat"


Solution

  • IIUC,

    we can take advantage of some regex and greed matching using .* which matches everything between a defined pattern

    import re
    
    df.columns = [re.search(':(.*)-',i).group(1) for i in df.columns.str.strip()]
    
    print(df.columns)
    
       sales platforms   Advertising letters 
    0      Not Selected                  None
    

    Edit:

    with greedy matching we can use +?

    +? Quantifier — Matches between one and unlimited times, as few times as possible, expanding as needed (lazy)
    

    Q36r9: sales platforms - Before purchasing a new car    Q40r1c3: WeChat - Looking for a new car - And now if you think again - Which social media platforms or sources would you use in each situation?
    0                                                       1
    
    
    import re
    
    [re.search(':(.+?)-',i).group(1).strip() for i in df.columns]
    
    ['sales platforms', 'WeChat']