Search code examples
pythonnlpnltk

How to find similar sounding words?


I'm writing a specialized (in food realm) multi-lingual search engine.
I use python and nltk libraries. I have quite a big database of recipes for all cultures I want to support.

I'm asking if and how it is possible to be able to find in my indexed words corpus a wrong spelled word...
For example, in Italian, to look for "couscous" word, many users would say/write "cus cus", or "cuscus"...

In synthesis, this is an example of how I tokenize my index of lexemes for search:

import re
import nltk
import string

corpus = 'italian'
stemmer = nltk.stem.snowball.ItalianStemmer()
stopWords = nltk.corpus.stopwords.words(corpus)

# tokenize the sentence(s)
wordTokenizedList = nltk.tokenize.word_tokenize(text)

# remove punctuation and everything lower case
wordTokenizedListNoPunct = [ word.lower() for word in wordTokenizedList if word not in string.punctuation ]

# remove stop words
wordTokenizedListNoPunctNoStopWords = [ word for word in wordTokenizedListNoPunct if word not in stopWords ]

# snowball stemmer 
wordTokenizedListNoPunctNoStopWordsStems = [ stemmer.stem(i) for i in wordTokenizedListNoPunctNoStopWords ]

return wordTokenizedListNoPunctNoStopWordsStems

Should I prepare my index someway differently to reach my goal?

Any additional remark about a more complete flow in the text analysis for tokenization should be welcome, of course... :-)


Solution

  • Well, I'd just use a database with elastic search capabilities.

    PROS:

    1. They already solve these kind of issues
    2. WAY faster
    3. Safer

    And well, a long etcetera as you can imagine.

    Is really easy to connect Python with SQLite and the FTS5 (Full Text Search) module works great!

    I'll highly recommend to watch the following video for you to get an idea if this will suit your solution :) video