Search code examples
pythonalgorithmsuffix-tree

suffix trees: locating a substring if a certain number of mistakes are allowed


According to the Wikipedia article on suffix trees, suffix trees can be used to locate substrings of a string if a certain number of mistakes are allowed.

Given the suffix tree of a string, how can I find all the instances of a given substrings of it, allowing for each instance at most one mistake?

(By "mistake", I mean the substitution of one character.)


Solution

  • That would be just a more convoluted graph searching (aka find the path through the dungeon where some of the doors are broken and need to be kicked open and you want to spare your feet).

    The details depend greatly on what do you mean by "mistake". So I take it that "mistake" is a substitution of one character, that is the easiest case.

    In the algorithm, you will search the tree from the root comparing and advancing your pattern as if you'd searched for exact match. Just if there was a character on the edge that you can't follow, you save the state of your algorithm for later (the state being [tree position, pattern position]). This should apply even when you can follow one link for a node, but not another - you follow the matching and save the others.

    Then, you return to the saved positions and emulate the substitution, that means advance one position in the tree (to all nonmatching possibilities) and one position in the pattern. Then, continue your search as normal (you have consumed your possibility of one error, so you're searching for exact match now).

    Whenever you reach the end of the pattern, report successful match (ie. all leaves below the current node in the tree).