Search code examples
python-3.xpandasdataframerenameregex-group

Using Regex groups to rename columns in a pandas dataframe by matching multiple patterns at a time


I have the following list of columns of a data frame df cols_2 = ['state', 'population', 'male population', 'female population', 'working population', 'male working population', 'female working population' 'female population in the age group 0 to 6 years', 'male population in the age group 0 to 6 years', 'population in the age group 0 to 6 years']

For natural reasons I would like to compress names as follows: Any occurrences of population to pop, male to m_, female to f_, working to w and in the age group 0 to 6 years to _minor

Please note the spaces being included in the pattern

This Stack overflow Discussion is the starting point, where the requirement is only getting rid off square brackets by matching to a single pattern.

My aim is to obtain multiple matches for multiple patterns

Really appreciate any kind of help!

PS: This is my first time here!


Solution

  • You can try:

    import re
    
    new_cols = []
    
    pat = re.compile(r"(?:fe)?male\s*|population|working\s*")
    for col in cols_2:
        new_col = pat.sub(
            lambda g: f"{g[0][0]}_" if g[0][0] in "fmw" else f"{g[0][:3]}", col
        )
        new_col = new_col.replace(" in the age group 0 to 6 years", "_minor")
        new_cols.append(new_col)
    
    print(new_cols)
    
    # to replace the columns in DataFrame then:
    # df.columns = new_cols
    

    Prints:

    [
        "state",
        "pop",
        "m_pop",
        "f_pop",
        "w_pop",
        "m_w_pop",
        "f_w_pop",
        "f_pop_minor",
        "m_pop_minor",
        "pop_minor",
    ]