so first of this is my first time asking a question here so forgive me if I make any mistakes.
My Problem is as follows: I'm using python to sort through a bunch of images. The images are sorted by many criteria, one of which is the text inside the Image. I've got OCR working and have a list of "bad" words which arent supposed to be in the Image. The problem is that the OCR often confuses some letters, for example e and a. The question is if there is an easy way to generate similar looking words.
Like create_similar("test")
And output would be ["test", "tast" "lest"]
and so on.
So I could use that as the list of Bad words and avoid false negatives. If I'm just missing a really obvious solution, please tell me. I've been trying for hours now and just can't get it to work.
I really recommend this article by Peter Norvig on how to build a spelling corrector. In it, you will find the following function that returns a set of all the edited strings (whether words or not) that can be made with one simple edit. A simple edit to a word is a deletion (remove one letter), a transposition (swap two adjacent letters), a replacement (change one letter to another) or an insertion (add a letter).
def edits1(word):
"All edits that are one edit away from `word`."
letters = 'abcdefghijklmnopqrstuvwxyz'
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [L + R[1:] for L, R in splits if R]
transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
replaces = [L + c + R[1:] for L, R in splits if R for c in letters]
inserts = [L + c + R for L, R in splits for c in letters]
return set(deletes + transposes + replaces + inserts)
For your use case, you probably are not interested in deletes, transposes and inserts, so you could simplify it to:
def create_similar(word):
"All edits that are one edit away from `word`."
letters = 'abcdefghijklmnopqrstuvwxyz'
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
replaces = {L + c + R[1:] for L, R in splits if R for c in letters}
replaces.remove(word)
return replaces
Result for:
create_similar("test")
is:
{'aest',
'best',
'cest',
'dest',
'eest',
'fest',
'gest',
'hest',
'iest',
'jest',
'kest',
'lest',
'mest',
'nest',
'oest',
'pest',
'qest',
'rest',
'sest',
'tast',
'tbst',
'tcst',
'tdst',
'teat',
'tebt',
'tect',
'tedt',
'teet',
'teft',
'tegt',
'teht',
'teit',
'tejt',
'tekt',
'telt',
'temt',
'tent',
'teot',
'tept',
'teqt',
'tert',
'tesa',
'tesb',
'tesc',
'tesd',
'tese',
'tesf',
'tesg',
'tesh',
'tesi',
'tesj',
'tesk',
'tesl',
'tesm',
'tesn',
'teso',
'tesp',
'tesq',
'tesr',
'tess',
'tesu',
'tesv',
'tesw',
'tesx',
'tesy',
'tesz',
'tett',
'teut',
'tevt',
'tewt',
'text',
'teyt',
'tezt',
'tfst',
'tgst',
'thst',
'tist',
'tjst',
'tkst',
'tlst',
'tmst',
'tnst',
'tost',
'tpst',
'tqst',
'trst',
'tsst',
'ttst',
'tust',
'tvst',
'twst',
'txst',
'tyst',
'tzst',
'uest',
'vest',
'west',
'xest',
'yest',
'zest'}