Search code examples
pythonstringocr

Generating similar words for OCR


so first of this is my first time asking a question here so forgive me if I make any mistakes. My Problem is as follows: I'm using python to sort through a bunch of images. The images are sorted by many criteria, one of which is the text inside the Image. I've got OCR working and have a list of "bad" words which arent supposed to be in the Image. The problem is that the OCR often confuses some letters, for example e and a. The question is if there is an easy way to generate similar looking words. Like create_similar("test") And output would be ["test", "tast" "lest"] and so on. So I could use that as the list of Bad words and avoid false negatives. If I'm just missing a really obvious solution, please tell me. I've been trying for hours now and just can't get it to work.


Solution

  • I really recommend this article by Peter Norvig on how to build a spelling corrector. In it, you will find the following function that returns a set of all the edited strings (whether words or not) that can be made with one simple edit. A simple edit to a word is a deletion (remove one letter), a transposition (swap two adjacent letters), a replacement (change one letter to another) or an insertion (add a letter).

    def edits1(word):
        "All edits that are one edit away from `word`."
        letters    = 'abcdefghijklmnopqrstuvwxyz'
        splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
        deletes    = [L + R[1:]               for L, R in splits if R]
        transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
        replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
        inserts    = [L + c + R               for L, R in splits for c in letters]
        return set(deletes + transposes + replaces + inserts)
    

    For your use case, you probably are not interested in deletes, transposes and inserts, so you could simplify it to:

    def create_similar(word):
        "All edits that are one edit away from `word`."
        letters    = 'abcdefghijklmnopqrstuvwxyz'
        splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
        replaces   = {L + c + R[1:]           for L, R in splits if R for c in letters}
        replaces.remove(word)
        return replaces
    

    Result for:

    create_similar("test")
    

    is:

    {'aest',
     'best',
     'cest',
     'dest',
     'eest',
     'fest',
     'gest',
     'hest',
     'iest',
     'jest',
     'kest',
     'lest',
     'mest',
     'nest',
     'oest',
     'pest',
     'qest',
     'rest',
     'sest',
     'tast',
     'tbst',
     'tcst',
     'tdst',
     'teat',
     'tebt',
     'tect',
     'tedt',
     'teet',
     'teft',
     'tegt',
     'teht',
     'teit',
     'tejt',
     'tekt',
     'telt',
     'temt',
     'tent',
     'teot',
     'tept',
     'teqt',
     'tert',
     'tesa',
     'tesb',
     'tesc',
     'tesd',
     'tese',
     'tesf',
     'tesg',
     'tesh',
     'tesi',
     'tesj',
     'tesk',
     'tesl',
     'tesm',
     'tesn',
     'teso',
     'tesp',
     'tesq',
     'tesr',
     'tess',
     'tesu',
     'tesv',
     'tesw',
     'tesx',
     'tesy',
     'tesz',
     'tett',
     'teut',
     'tevt',
     'tewt',
     'text',
     'teyt',
     'tezt',
     'tfst',
     'tgst',
     'thst',
     'tist',
     'tjst',
     'tkst',
     'tlst',
     'tmst',
     'tnst',
     'tost',
     'tpst',
     'tqst',
     'trst',
     'tsst',
     'ttst',
     'tust',
     'tvst',
     'twst',
     'txst',
     'tyst',
     'tzst',
     'uest',
     'vest',
     'west',
     'xest',
     'yest',
     'zest'}