Search code examples
pythonlevenshtein-distance

Write a Python method to generate typos based on a string


I could just add in something that creates typos based on Levenshtein distance of two, or something like that, or reverse-engineer Norvig's article on spellchecking.

However, what are the most common ways to typos?

Has somebody written a method?


Solution

  • There's no such thing as general typo generation algorithm because this kind of algorithm depends on the target language and application - ie to generate spam domains you basically need to apply following strategies (using meta.stackoverflow.com as an example):

    1. missing dots: met*as*tackoverflow.com (should be easy ;)
    2. character insertion: meta.stackoverflo*ww*.com (just add a dupe of every character)
    3. character omission: meta.stackoverf*lw*.com (just drop a character)
    4. character permutation: meta.stackove*fr*low.com (pure mathematics here)
    5. character replacement: meta.*d*tackoverflow.com (now here we can have at least two strategies, see below)

    In case of character replacement we can have at least two scenarios:

    1. Similar sounding letters (ie c <-> k, z <-> ts ) depending on language
    2. Nearby letters proximity typos (ie for qwerty s <-> d, d <-> f) Duh, I actually made a typo here with s <-> d case :)

    Hope this helps..