Search code examples
stringalgorithmtheorydfasubsequence

Deterministic automata to find number of subsequence in string of another string


Deterministic automata to find number of subsequences in string ? How can I construct a DFA to find number of occurence string as a subsequence in another string?

eg. In "ssstttrrriiinnngggg" we have 3 subsequences which form string "string" ?

also both string to be found and to be searched only contain characters from specific character Set . I have some idea about storing characters in stack poping them accordingly till we match , if dont match push again . Please tell DFA solution ?


Solution

  • OVERLAPPING MATCHES

    If you wish to count the number of overlapping sequences then you simply construct a DFA that matches the string, e.g.

    1 -(if see s)-> 2 -(if see t)-> 3 -(if see r)-> 4 -(if see i)-> 5 -(if see n)-> 6 -(if see g)-> 7

    and then compute the number of ways of being in each state after seeing each character using dynamic programming. See the answers to this question for more details.

    DP[a][b] = number of ways of being in state b after seeing the first a characters
             = DP[a-1][b] + DP[a-1][b-1] if character at position a is the one needed to take state b-1 to b
             = DP[a-1][b] otherwise
    

    Start with DP[0][b]=0 for b>1 and DP[0][1]=1.

    Then the total number of overlapping strings is DP[len(string)][7]

    NON-OVERLAPPING MATCHES

    If you are counting the number of non-overlapping sequences, then if we assume that the characters in the pattern to be matched are distinct, we can use a slight modification:

    DP[a][b] = number of strings being in state b after seeing the first a characters
             = DP[a-1][b] + 1 if character at position a is the one needed to take state b-1 to b and  DP[a-1][b-1]>0
             = DP[a-1][b] - 1 if character at position a is the one needed to take state b to b+1 and DP[a-1][b]>0
             = DP[a-1][b] otherwise
    

    Start with DP[0][b]=0 for b>1 and DP[0][1]=infinity.

    Then the total number of non-overlapping strings is DP[len(string)][7]

    This approach will not necessarily give the correct answer if the pattern to be matched contains repeated characters (e.g. 'strings').