python string python-3.x machine-learning string-matching

how to generate a set of similar strings in python

I am wondering how to generate a set of similar strings based on Levenshtein distance (string edit distance). Ideally, I like to pass in, a source string (i.e. a string which is used to generate other strings that are similar to it), the number of strings need to be generated and a threshold as parameters, i.e. similarities among the strings in the generated set should be greater than the threshold. I am wondering what Python package(s) should I use to achieve that? Or any idea how to implement this?

Solution

I think you can think of the problem in another way (reversed).

Given a string, say it is sittin.
Given a threshold (edit distance), say it is k.
Then you apply combinations of different "edits" in k-steps.

For example, let's say k = 2. And assume the allowed edit modes you have are:

delete one character
add one character
substitute one character with another one.

Then the logic is something like below:

input = 'sittin'
for num in 1 ... n:  # suppose you want to have n strings generated
  my_input_ = input
  # suppose the edit distance should be smaller or equal to k;
  # but greater or equal to one
  for i in in 1 ... randint(k): 
    pick a random edit mode from (delete, add, substitute)
    do it! and update my_input_

If you need to stick with a pre-defined dictionary, that adds some complexity but it is still doable. In this case, the edit must be valid.