I need to build a NER
system (Named Entity Recognition). For simplicity, I am doing it by using approximate string matching as input can contain typos and other minor modifications. I have come across some great libraries like: fuzzywuzzy or even faster RapidFuzz. But unfortunately I didn't find a way to return the position where the match occurs. As, for my purpose I not only need to find the match, but also I need to know where the match happened. As for NER
, I need to replace those matches with some predefined string.
For example, If any one of the line is found in input string I want to replace them with the string COMPANY_NAME
:
google
microsoft
facebook
International Business Machine
Like, input: S/he works at Google
will be transformed to S/he works at COMPANY_NAME
.
You can safely assume that, all the input and the pattern to match are already preprocessed and most importantly they are in lower-case now. So, there is no problem with case-sensitivity.
Currently, I have approached with a sliding window technique. And a sliding window is passed over the input string from left to right and this window has exactly the size of the pattern we want to match. For example, when I want to match with International Business Machine
, I run a sliding window of size 3
from left to right and try to find the best match by observing each 3
consecutive tokens at the same time with a stride of 1
. I do believe, it is not the best way to do it, also it cannot find the best match.
So, what is the efficient way to find the best possible match along with the quantification on the found match (how much they are similar) and the position of the match(es), such that we can replace them with a given fixed string (if the calculated similarity is not less than a threshold)? Obviously, a single input may contain multiple portions to be replaced, each of them will be replaced separately, like: Google and Microsoft are big companies
will become COMPANY_NAME and COMPANY_NAME are big companies
etc.
EDIT: fixed link to RapidFuzz
It seems modules fuzzywuzzy
and RapidFuzz
don't have function for this. You could try to use process.extract()
or process.extractOne()
but it would need to split text in smaller parts (ie. words) and check every part separatelly. For longer words like International Business Machine
it would need to split in part with 3 words - so it would need even more work.
I think you need rather module fuzzysearch
import fuzzysearch
words = ['google', 'microsoft', 'facebook', 'International Business Machine']
text = 'Google and Microsoft are big companies like International Business Machine'
print(' text:', text)
print('---')
for word in sorted(words, key=len, reverse=True):
print(' word:', word)
results = fuzzysearch.find_near_matches(word, text, max_l_dist=1)
print('found:', results)
for item in reversed(results):
text = text[:item.start] + 'COMPANY' + text[item.end:]
print(' text:', text)
print('---')
Result:
text: Google and Microsoft are big companies like facebook International Business Machine
---
word: International Business Machine
found: [Match(start=53, end=83, dist=0, matched='International Business Machine')]
text: Google and Microsoft are big companies like facebook COMPANY
---
word: microsoft
found: [Match(start=11, end=20, dist=1, matched='Microsoft')]
text: Google and COMPANY are big companies like facebook COMPANY
---
word: facebook
found: [Match(start=42, end=50, dist=0, matched='facebook')]
text: Google and COMPANY are big companies like COMPANY COMPANY
---
word: google
found: [Match(start=0, end=6, dist=1, matched='Google')]
text: COMPANY and COMPANY are big companies like COMPANY COMPANY
If it finds many results for one word then it is better to start replacing at last position to keep other words in the same place. And this is why I use reversed()
.
I would start also with the longest word/name so later it still can search shorter words like Business
. And this is why I use sorted(..., key=len, reverse=True)
But I'm not sure if it works exactly as you want. Maybe it will have problem when words are more incorrect.
EDIT:
I tried to use fuzzywuzzy
for this and created this version but only for names with single word. For International Business Machine
it would need some other idea.
It split full text into words and compare words. Later replace word wich have ration > 80
words = ['google', 'microsoft', 'facebook', 'International Business Machine']
text = 'Google and Microsoft are big companies like International Business Machine'
# ---
import fuzzywuzzy.fuzz as fuzz
#import fuzzywuzzy.process
new_words = []
for part in text.split():
matches = []
for word in words:
result = fuzz.token_sort_ratio(part, word)
matches.append([result, part, word])
#print([result, part, word])
matches = sorted(matches, reverse=True)
if matches and matches[0][0] > 80:
new_words.append('COMPANY')
else:
new_words.append(matches[0][1])
print(" ".join(new_words))
Result:
[100, 'Google', 'google']
[27, 'Google', 'microsoft']
[29, 'Google', 'facebook']
[17, 'Google', 'International Business Machine']
[0, 'and', 'google']
[0, 'and', 'microsoft']
[18, 'and', 'facebook']
[12, 'and', 'International Business Machine']
[27, 'Microsoft', 'google']
[100, 'Microsoft', 'microsoft']
[35, 'Microsoft', 'facebook']
[15, 'Microsoft', 'International Business Machine']
[22, 'are', 'google']
[17, 'are', 'microsoft']
[36, 'are', 'facebook']
[12, 'are', 'International Business Machine']
[22, 'big', 'google']
[17, 'big', 'microsoft']
[18, 'big', 'facebook']
[12, 'big', 'International Business Machine']
[27, 'companies', 'google']
[33, 'companies', 'microsoft']
[24, 'companies', 'facebook']
[26, 'companies', 'International Business Machine']
[40, 'like', 'google']
[15, 'like', 'microsoft']
[17, 'like', 'facebook']
[18, 'like', 'International Business Machine']
[21, 'International', 'google']
[27, 'International', 'microsoft']
[19, 'International', 'facebook']
[60, 'International', 'International Business Machine']
[14, 'Business', 'google']
[24, 'Business', 'microsoft']
[12, 'Business', 'facebook']
[42, 'Business', 'International Business Machine']
[15, 'Machine', 'google']
[25, 'Machine', 'microsoft']
[40, 'Machine', 'facebook']
[38, 'Machine', 'International Business Machine']
COMPANY and COMPANY are big companies like International Business Machine
EDIT:
Second version which check also names with many words
all_names = ['google', 'microsoft', 'facebook', 'International Business Machine']
text = 'Google and Microsoft are big companies like International Business Machine'
# ---
import fuzzywuzzy.fuzz as fuzz
for name in all_names:
length = len(name.split(' ')) # how many words has name
print('name length:', length, '|', name)
words = text.split() # split text into words
# compare name with all words in text
matches = []
for index in range(0, len(words)-length+1):
# join words if name has more then 1 word
part = " ".join(words[index:index+length])
#print('part:', part)
result = fuzz.token_sort_ratio(part, name)
matches.append([result, name, part, [index, index+length]])
print([result, name, part, [index, index+length]])
# reverse to start at last position
matches = list(reversed(matches))
max_match = max(x[0] for x in matches)
print('max match:', max_match)
# replace
if max_match > 80:
for match in matches:
if match[0] == max_match:
idx = match[3]
words = words[:idx[0]] + ['COMPANY'] + words[idx[1]:]
text = " ".join(words)
print('text:', text)
print('---')
Result:
ame length: 1 | google
[100, 'google', 'Google', [0, 1]]
[0, 'google', 'and', [1, 2]]
[27, 'google', 'Microsoft', [2, 3]]
[22, 'google', 'are', [3, 4]]
[22, 'google', 'big', [4, 5]]
[27, 'google', 'companies', [5, 6]]
[40, 'google', 'like', [6, 7]]
[21, 'google', 'International', [7, 8]]
[14, 'google', 'Business', [8, 9]]
[15, 'google', 'Machine', [9, 10]]
max match: 100
text: COMPANY and Microsoft are big companies like International Business Machine
---
name length: 1 | microsoft
[25, 'microsoft', 'COMPANY', [0, 1]]
[0, 'microsoft', 'and', [1, 2]]
[100, 'microsoft', 'Microsoft', [2, 3]]
[17, 'microsoft', 'are', [3, 4]]
[17, 'microsoft', 'big', [4, 5]]
[33, 'microsoft', 'companies', [5, 6]]
[15, 'microsoft', 'like', [6, 7]]
[27, 'microsoft', 'International', [7, 8]]
[24, 'microsoft', 'Business', [8, 9]]
[25, 'microsoft', 'Machine', [9, 10]]
max match: 100
text: COMPANY and COMPANY are big companies like International Business Machine
---
name length: 1 | facebook
[27, 'facebook', 'COMPANY', [0, 1]]
[18, 'facebook', 'and', [1, 2]]
[27, 'facebook', 'COMPANY', [2, 3]]
[36, 'facebook', 'are', [3, 4]]
[18, 'facebook', 'big', [4, 5]]
[24, 'facebook', 'companies', [5, 6]]
[17, 'facebook', 'like', [6, 7]]
[19, 'facebook', 'International', [7, 8]]
[12, 'facebook', 'Business', [8, 9]]
[40, 'facebook', 'Machine', [9, 10]]
max match: 40
text: COMPANY and COMPANY are big companies like International Business Machine
---
name length: 3 | International Business Machine
[33, 'International Business Machine', 'COMPANY and COMPANY', [0, 3]]
[31, 'International Business Machine', 'and COMPANY are', [1, 4]]
[31, 'International Business Machine', 'COMPANY are big', [2, 5]]
[34, 'International Business Machine', 'are big companies', [3, 6]]
[38, 'International Business Machine', 'big companies like', [4, 7]]
[69, 'International Business Machine', 'companies like International', [5, 8]]
[88, 'International Business Machine', 'like International Business', [6, 9]]
[100, 'International Business Machine', 'International Business Machine', [7, 10]]
max match: 100
text: COMPANY and COMPANY are big companies like COMPANY
EDIT:
Version with fuzzywuzzy.process
This time I don't have positions and I simply use standard text.replace(item[0], 'COMPANY')
.
I think in most situations it will work correctly and it doesn't need better method.
This time I check it on text with mistakes:
'Gogle and Mikro-Soft are big companies like Fasebok and Internat. Businnes Machin'
all_names = ['google', 'microsoft', 'facebook', 'International Business Machine']
text = 'Google and Microsoft are big companies like Facebook and International Business Machine'
# text with mistakes
text = 'Gogle and Mikro-Soft are big companies like Fasebok and Internat. Businnes Machin'
# ---
import fuzzywuzzy.process
#import fuzzywuzzy.fuzz
for name in sorted(all_names, key=len, reverse=True):
lenght = len(name.split())
words = text.split()
words = [" ".join(words[i:i+lenght]) for i in range(0, len(words)-lenght+1)]
#print(words)
#result = fuzzywuzzy.process.extractBests(name, words, scorer=fuzzywuzzy.fuzz.token_sort_ratio, score_cutoff=80)
result = fuzzywuzzy.process.extractBests(name, words, score_cutoff=80)
print(name, result)
for item in result:
text = text.replace(item[0], 'COMPANY')
print(text)