Background
I have a df
import pandas as pd
import nltk
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
df= pd.DataFrame({'ID': [1,2,3],
'Text':['This num dogs and cats is (111)888-8780 and other',
'dont block cow 23 here',
'cat two num: dog and cows here']
})
I also have a list
word_list = ['dog', 'cat', 'cow']
and a function that is supposed to do fuzzy matching on the Text
column of the df with the word_list
def fuzzy(row, word_list):
tweet = row[0]
fuzzy_match = []
for word in word_list:
token_words = nltk.word_tokenize(tweet)
for token in range(0, len(token_words) - 1):
fuzzy_fx = process.extract(word_list[word], token_words[token], limit=100, scorer = fuzz.ratio)
fuzzy_match.append(fuzzy_fx[0])
return pd.Series([fuzzy_match], index = ['Fuzzy_Match'])
I then join
df_fuzz = df.join(df.apply(lambda x: fuzzy(x, word_list), axis = 1))
But I get an error
TypeError: expected string or bytes-like object
Desired output
My desired output would be a 1) new column Fuzzy_Match
with the output of the fuzzy
function
ID Text Fuzzy_Match
0 1 This num dogs and cats is (111)888-8780 and other output of fuzzy 1
1 2 dont block cow 23 here output of fuzzy 2
2 3 cat two num: dog and cows here output of fuzzy 3
Question What do I need to do to get my desired output?
This should work:
In [32]: def fuzzy(row, word_list):
...: tweet = row[1]
...: fuzzy_match = []
...: token_words = nltk.word_tokenize(tweet)
...: for word in word_list:
...:
...: fuzzy_fx = process.extract(word, token_words, limit=100, scorer = fuzz.ratio)
...: fuzzy_match.append(fuzzy_fx[0])
...:
...: return pd.Series([fuzzy_match], index = ['Fuzzy_Match'])
df_fuzz = df.join(df.apply(lambda x: fuzzy(x, word_list), axis = 1))
process.extract()
expects a list as the second argument. you can read more about it here.
python fuzzywuzzy's process.extract(): how does it work?