Search code examples
pythonpandasfuzzywuzzy

How to combine data columns with similar column names Pandas


I have a data with many similar column names (basically mis-spell words), for example:

apple    grapes    apples    bana    apyles    grayes    graph    banana

Here, I want to combine the columns 'apple, apples, apyles', then 'grapes, grayes, graph', and 'bana, banana'. How can I do this?

*Edited for comment:

Q. What do you mean when you say “combine”. Can you include sample input and output?

Ans.

Input

apple    grapes    apples    bana    apyles    grayes    graph    banana
  1         2         3        4        5         6        7         8

Output

apple    grape    banana
  9       15         12 

Solution

  • With fuzzywuzzy you can try something as follows. Please note that the optimal fuzz.ratio that I could use to get this to work is 70:

    import pandas as pd
    from fuzzywuzzy import fuzz
    l = []
    correct = ['apple', 'grapes', 'banana']
    cols = df.columns[df.columns.isin(correct)]
    for col in cols:
        l.append([c for c in df.columns if fuzz.ratio(col,c) > 70])
    df = df.T.reset_index()
    for i in range(len(correct)):
        for j in l[i]:
            df['index'] = df['index'].replace(j, correct[i])
    df = df.groupby('index').sum().T
    df
    Out[1]: 
    index  apple  banana  grapes
    0          9      12      15