Search code examples
pythonpandasnlpspell-checking

how to implement fast spellchecker in Python with Pandas?


I work on text mining problem and need to extract all mentioned of certain keywords. For example, given the list:

list_of_keywords = ['citalopram', 'trazodone', 'aspirin']

I need to find all occurrences of the keywords in a text. That could be easily done with Pandas (assuming my text is read in from a csv file):

import pandas as pd

df_text = pd.read_csv('text.csv')
df_text['matches'] = df_text.str.findall('|'.join(list_of_keywords))

However, there are spelling mistakes in the text and some times my keywords will be written as:

'citalopram' as 'cetalopram'

or

'trazodone' as 'trazadon'

Searching on the web, I found some suggestions how to implement the spell checker, but I am not sure where to insert the spell checker and I reckon that it may slow down the search in the case a very large text.

As another option, it has been suggested to use a wild card with regex and insert in the potential locations of confusion (conceptually written)

.findall('c*t*l*pr*m')

However I am not convinced that I can capture all possible problematic cases. I tried some out-of-the-box spell checkers, but my texts are some-what specific and I need a spell checker that 'knows' my domain (medical domain).

QUESTION

Is there any efficient way to extract keywords from a text including spelling mistakes?


Solution

  • You are right, you cannot capture all possible misspellings with regular expressions.

    You do however have options.

    You could

    • Use a trie. A lot of auto complete and spell checking solutions use tries. However most of them operate on a word for word basis. Not on text as a whole
    • That being said, what you really want is a fuzzy text extractor as you just want to match alternate/slightly wrong spellings and not correct those spellings in the text. So you have more options here as well
      • Computational genomics has this challenge where they want to search for patterns of base pairs in long sequences. They allow for a certain amount of mismatch in the matched text. So an approximate matching solution like the ones outlined here will help. Those slides have excellent use of the pigeon hole principle to do what you need and the code is open source too!
      • If you want something a lot less complex, just run an edit distance filter on all the words of the doc and admit only words with an edit distance of k or less.

    To expand on what I mean by Edit Distance

    (Images/Code borrowed from slides linked above, slides are free to use for anyone i.e no license)

    Let us examine a simpler concept of Hamming Distance

    def hammingDistance(x,  y):
        assert len(x) == len(y)
        nmm = 0
        for i in xrange(0,  len(x)):
            if x[i] != y[i]:
                nmm += 1
        return nmm
    

    Hamming distance returns the number of characters that must swapped between 2 equal length strings to make them equal.

    But what happens when the strings are not equal length?

    Use editDistance Which is the num of characters that must be swapped/inserted/deleted on 2 strings to make them equal

    Hamming distance becomes the base case for a recursive algorithm now

    def edDistRecursive(x, y):
        if len(x) == 0: return len(y)
        if len(y) == 0: return len(x)
        delt = 1 if x[-1] != y[-1] else 0
        diag = edDistRecursive(x[:-1], y[:-1]) + delt
        vert = edDistRecursive(x[:-1], y) + 1
        horz = edDistRecursive(x, y[:-1]) + 1
        return min(diag, vert, horz)
    

    Just call the func above on what you think the word would/should match to (maybe by first looking up a trie). You can even memoize the soln to make it faster as there is a high probability for overlap