I am wondering how to generate a set of similar strings based on Levenshtein distance
(string edit distance). Ideally, I like to pass in, a source string (i.e. a string which is used to generate other strings that are similar to it), the number of strings need to be generated and a threshold as parameters, i.e. similarities among the strings in the generated set should be greater than the threshold. I am wondering what Python
package(s) should I use to achieve that? Or any idea how to implement this?
I think you can think of the problem in another way (reversed).
k
. For example, let's say k = 2. And assume the allowed edit modes you have are:
Then the logic is something like below:
input = 'sittin'
for num in 1 ... n: # suppose you want to have n strings generated
my_input_ = input
# suppose the edit distance should be smaller or equal to k;
# but greater or equal to one
for i in in 1 ... randint(k):
pick a random edit mode from (delete, add, substitute)
do it! and update my_input_
If you need to stick with a pre-defined dictionary, that adds some complexity but it is still doable. In this case, the edit must be valid.