I work on text mining problem and need to extract all mentioned of certain keywords. For example, given the list:
list_of_keywords = ['citalopram', 'trazodone', 'aspirin']
I need to find all occurrences of the keywords in a text. That could be easily done with Pandas (assuming my text is read in from a csv file):
import pandas as pd
df_text = pd.read_csv('text.csv')
df_text['matches'] = df_text.str.findall('|'.join(list_of_keywords))
However, there are spelling mistakes in the text and some times my keywords will be written as:
'citalopram' as 'cetalopram'
or
'trazodone' as 'trazadon'
Searching on the web, I found some suggestions how to implement the spell checker, but I am not sure where to insert the spell checker and I reckon that it may slow down the search in the case a very large text.
As another option, it has been suggested to use a wild card with regex and insert in the potential locations of confusion (conceptually written)
.findall('c*t*l*pr*m')
However I am not convinced that I can capture all possible problematic cases. I tried some out-of-the-box spell checkers, but my texts are some-what specific and I need a spell checker that 'knows' my domain (medical domain).
QUESTION
Is there any efficient way to extract keywords from a text including spelling mistakes?
You are right, you cannot capture all possible misspellings with regular expressions.
You do however have options.
You could
k
or less.To expand on what I mean by Edit Distance
(Images/Code borrowed from slides linked above, slides are free to use for anyone i.e no license)
Let us examine a simpler concept of Hamming Distance
def hammingDistance(x, y):
assert len(x) == len(y)
nmm = 0
for i in xrange(0, len(x)):
if x[i] != y[i]:
nmm += 1
return nmm
Hamming distance returns the number of characters that must swapped between 2 equal length strings to make them equal.
But what happens when the strings are not equal length?
Use editDistance Which is the num of characters that must be swapped/inserted/deleted on 2 strings to make them equal
Hamming distance becomes the base case for a recursive algorithm now
def edDistRecursive(x, y):
if len(x) == 0: return len(y)
if len(y) == 0: return len(x)
delt = 1 if x[-1] != y[-1] else 0
diag = edDistRecursive(x[:-1], y[:-1]) + delt
vert = edDistRecursive(x[:-1], y) + 1
horz = edDistRecursive(x, y[:-1]) + 1
return min(diag, vert, horz)
Just call the func above on what you think the word would/should match to (maybe by first looking up a trie). You can even memoize the soln to make it faster as there is a high probability for overlap