Generating similar words for OCR

so first of this is my first time asking a question here so forgive me if I make any mistakes. My Problem is as follows: I'm using python to sort through a bunch of images. The images are sorted by many criteria, one of which is the text inside the Image. I've got OCR working and have a list of "bad" words which arent supposed to be in the Image. The problem is that the OCR often confuses some letters, for example e and a. The question is if there is an easy way to generate similar looking words. Like create_similar("test") And output would be ["test", "tast" "lest"] and so on. So I could use that as the list of Bad words and avoid false negatives. If I'm just missing a really obvious solution, please tell me. I've been trying for hours now and just can't get it to work.

Solution

I really recommend this article by Peter Norvig on how to build a spelling corrector. In it, you will find the following function that returns a set of all the edited strings (whether words or not) that can be made with one simple edit. A simple edit to a word is a deletion (remove one letter), a transposition (swap two adjacent letters), a replacement (change one letter to another) or an insertion (add a letter).

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

For your use case, you probably are not interested in deletes, transposes and inserts, so you could simplify it to:

def create_similar(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    replaces   = {L + c + R[1:]           for L, R in splits if R for c in letters}
    replaces.remove(word)
    return replaces

Result for:

create_similar("test")

is:

{'aest',
 'best',
 'cest',
 'dest',
 'eest',
 'fest',
 'gest',
 'hest',
 'iest',
 'jest',
 'kest',
 'lest',
 'mest',
 'nest',
 'oest',
 'pest',
 'qest',
 'rest',
 'sest',
 'tast',
 'tbst',
 'tcst',
 'tdst',
 'teat',
 'tebt',
 'tect',
 'tedt',
 'teet',
 'teft',
 'tegt',
 'teht',
 'teit',
 'tejt',
 'tekt',
 'telt',
 'temt',
 'tent',
 'teot',
 'tept',
 'teqt',
 'tert',
 'tesa',
 'tesb',
 'tesc',
 'tesd',
 'tese',
 'tesf',
 'tesg',
 'tesh',
 'tesi',
 'tesj',
 'tesk',
 'tesl',
 'tesm',
 'tesn',
 'teso',
 'tesp',
 'tesq',
 'tesr',
 'tess',
 'tesu',
 'tesv',
 'tesw',
 'tesx',
 'tesy',
 'tesz',
 'tett',
 'teut',
 'tevt',
 'tewt',
 'text',
 'teyt',
 'tezt',
 'tfst',
 'tgst',
 'thst',
 'tist',
 'tjst',
 'tkst',
 'tlst',
 'tmst',
 'tnst',
 'tost',
 'tpst',
 'tqst',
 'trst',
 'tsst',
 'ttst',
 'tust',
 'tvst',
 'twst',
 'txst',
 'tyst',
 'tzst',
 'uest',
 'vest',
 'west',
 'xest',
 'yest',
 'zest'}