Search code examples
algorithmsubsequencelcs

DNA subsequence dynamic programming question


I'm trying to solve DNA problem which is more of improved(?) version of LCS problem. In the problem, there is string which is string and semi-substring which allows part of string to have one or no letter skipped. For example, for string "desktop", it has semi-substring {"destop", "dek", "stop", "skop","desk","top"}, all of which has one or no letter skipped.

Now, I am given two DNA strings consisting of {a,t,g,c}. I"m trying to find longest semi-substring, LSS. and if there is more than one LSS, print out the one in the fastest order.

For example, two dnas {attgcgtagcaatg, tctcaggtcgatagtgac} prints out "tctagcaatg"

and aaaattttcccc, cccgggggaatatca prints out "aattc"

I'm trying to use common LCS algorithm but cannot solve it with tables although I did solve the one with no letter skipped. Any advice?


Solution

  • This is a variation on the dynamic programming solution for LCS, written in Python.

    First I'm building up a Suffix Tree for all the substrings that can be made from each string with the skip rule. Then I'm intersecting the suffix trees. Then I'm looking for the longest string that can be made from that intersection tree.

    Please note that this is technically O(n^2). Its worst case is when both strings are the same character, repeated over and over again. Because you wind up with a lot of what logically is something like, "an 'l' at position 42 in the one string could have matched against position l at position 54 in the other". But in practice it will be O(n).

    def find_subtree (text, max_skip=1):
        tree = {}
        tree_at_position = {}
    
        def subtree_from_position (position):
            if position not in tree_at_position:
                this_tree = {}
                if position < len(text):
                    char = text[position]
                    # Make sure that we've populated the further tree.
                    subtree_from_position(position + 1)
    
                    # If this char appeared later, include those possible matches.
                    if char in tree:
                        for char2, subtree in tree[char].iteritems():
                            this_tree[char2] = subtree
    
                    # And now update the new choices.
                    for skip in range(max_skip + 1, 0, -1):
                        if position + skip < len(text):
                            this_tree[text[position + skip]] = subtree_from_position(position + skip)
    
                    tree[char] = this_tree
    
                tree_at_position[position] = this_tree
    
            return tree_at_position[position]
    
        subtree_from_position(0)
    
        return tree
    
    
    def find_longest_common_semistring (text1, text2):
        tree1 = find_subtree(text1)
        tree2 = find_subtree(text2)
    
        answered = {}
        def find_intersection (subtree1, subtree2):
            unique = (id(subtree1), id(subtree2))
            if unique not in answered:
                answer = {}
                for k, v in subtree1.iteritems():
                    if k in subtree2:
                        answer[k] = find_intersection(v, subtree2[k])
                answered[unique] = answer
            return answered[unique]
    
    
        found_longest = {}
        def find_longest (tree):
            if id(tree) not in found_longest:
                best_candidate = ''
                for char, subtree in tree.iteritems():
                    candidate = char + find_longest(subtree)
                    if len(best_candidate) < len(candidate):
                        best_candidate = candidate
                found_longest[id(tree)] = best_candidate
            return found_longest[id(tree)]
    
        intersection_tree = find_intersection(tree1, tree2)
        return find_longest(intersection_tree)
    
    
    print(find_longest_common_semistring("attgcgtagcaatg", "tctcaggtcgatagtgac"))