I have a data with many similar column names (basically mis-spell words), for example:
apple grapes apples bana apyles grayes graph banana
Here, I want to combine the columns 'apple, apples, apyles', then 'grapes, grayes, graph', and 'bana, banana'. How can I do this?
*Edited for comment:
Q. What do you mean when you say “combine”. Can you include sample input and output?
Ans.
Input
apple grapes apples bana apyles grayes graph banana
1 2 3 4 5 6 7 8
Output
apple grape banana
9 15 12
With fuzzywuzzy
you can try something as follows. Please note that the optimal fuzz.ratio
that I could use to get this to work is 70
:
import pandas as pd
from fuzzywuzzy import fuzz
l = []
correct = ['apple', 'grapes', 'banana']
cols = df.columns[df.columns.isin(correct)]
for col in cols:
l.append([c for c in df.columns if fuzz.ratio(col,c) > 70])
df = df.T.reset_index()
for i in range(len(correct)):
for j in l[i]:
df['index'] = df['index'].replace(j, correct[i])
df = df.groupby('index').sum().T
df
Out[1]:
index apple banana grapes
0 9 12 15