Search code examples
stringalgorithmanagram

Anagram of String 2 is Substring of String 1


How to find that any anagram of String 1 is sub string of String 2?

Eg :-

String 1 =rove

String 2=stackoverflow

So it will return true as anagram of "rove" is "over" which is sub-string of String 2


Solution

  • On edit: my first answer was quadratic in the worst case. I've tweaked it to be strictly linear:

    Here is an approach based on the notion of a sliding window: Create a dictionary keyed by the letters of the first dictionary with frequency counts of the letters for the corresponding values. Think of this as a dictionary of targets which need to be matched by m consecutive letters in the second string, where m is the length of the first string.

    Start by processing the first m letters in the second string. For each such letter if it appears as a key in the target dictionary decrease the corresponding value by 1. The goal is to drive all target values to 0. Define discrepancy to be the sum of the absolute values of the values after processing the first window of m letters.

    Repeatedly do the following: check if discrepancy == 0 and return Trueif it does. Otherwise -- take the character m letters ago and check if it is a target key and if so -- increase the value by 1. In this case, this either increases or decreases the discrepancy by 1, adjust accordingly. Then get the next character of the second string and process it as well. Check if it is a key in the dictionary and if so adjust the value and the discrepancy as appropriate.

    Since there are no nested loop and each pass through the main loop involves just a few dictionary lookups, comparisons, addition and subtractions, the overall algorithm is linear.

    A Python 3 implementation (which shows the basic logic of how the window slides and the target counts and discrepancy are adjusted):

    def subAnagram(s1,s2):
        m = len(s1)
        n = len(s2)
        if m > n: return false
        target = dict.fromkeys(s1,0)
        for c in s1: target[c] += 1
    
        #process initial window
        for i in range(m):
            c = s2[i]
            if c in target:
                target[c] -= 1
        discrepancy = sum(abs(target[c]) for c in target)
    
        #repeatedly check then slide:
        for i in range(m,n):
            if discrepancy == 0:
                return True
            else:
                #first process letter from m steps ago from s2
                c = s2[i-m]
                if c in target:
                    target[c] += 1
                    if target[c] > 0: #just made things worse
                        discrepancy +=1
                    else:
                        discrepancy -=1
                #now process new letter:
                c = s2[i]
                if c in target:
                    target[c] -= 1
                    if target[c] < 0: #just made things worse
                        discrepancy += 1
                    else:
                        discrepancy -=1
        #if you get to this stage:
        return discrepancy == 0
    

    Typical output:

    >>> subAnagram("rove", "stack overflow")
    True
    >>> subAnagram("rowe", "stack overflow")
    False
    

    To stress-test it, I downloaded the complete text of Moby Dick from Project Gutenberg. This has over 1 million characters. "Formosa" is mentioned in the book, hence an anagram of "moors" appears as a substring of Moby Dick. But, not surprisingly, no anagram of "stackoverflow" appears in Moby Dick:

    >>> f = open("moby dick.txt")
    >>> md = f.read()
    >>> f.close()
    >>> len(md)
    1235186
    >>> subAnagram("moors",md)
    True
    >>> subAnagram("stackoverflow",md)
    False
    

    The last call takes roughly 1 second to process the complete text of Moby Dick and verify that no anagram of "stackoverflow" appears in it.