Search code examples
pythonstringpython-3.xmachine-learningstring-matching

how to generate a set of similar strings in python


I am wondering how to generate a set of similar strings based on Levenshtein distance (string edit distance). Ideally, I like to pass in, a source string (i.e. a string which is used to generate other strings that are similar to it), the number of strings need to be generated and a threshold as parameters, i.e. similarities among the strings in the generated set should be greater than the threshold. I am wondering what Python package(s) should I use to achieve that? Or any idea how to implement this?


Solution

  • I think you can think of the problem in another way (reversed).

    • Given a string, say it is sittin.
    • Given a threshold (edit distance), say it is k.
    • Then you apply combinations of different "edits" in k-steps.

    For example, let's say k = 2. And assume the allowed edit modes you have are:

    • delete one character
    • add one character
    • substitute one character with another one.

    Then the logic is something like below:

    input = 'sittin'
    for num in 1 ... n:  # suppose you want to have n strings generated
      my_input_ = input
      # suppose the edit distance should be smaller or equal to k;
      # but greater or equal to one
      for i in in 1 ... randint(k): 
        pick a random edit mode from (delete, add, substitute)
        do it! and update my_input_
    

    If you need to stick with a pre-defined dictionary, that adds some complexity but it is still doable. In this case, the edit must be valid.