Search code examples
stringalgorithmlanguage-agnosticsubstringcycle

Find the dominant cyclic substring of a given string


I am looking for an algorithm to find the dominant cyclic substring of a given string.

A cyclic substring:

  • a substring which is repeated two or more times adjacently.

A dominant cyclic substring:

  • The substring which has the most adjacent repetitions is dominant
  • (adjacent repetitions occur an equal number of times)
    • the longest length substring is dominant
  • (on ties of length and adjacent repetitions)
    • the substring that appears first is dominant

Example 1:

  • prefixgarbagecyclecyclecyclecyclesufixgarbage
  • returns cycle:=> cycle is the most repeated adjacent substring

Example 2:

  • prefixgarbagecyclepadinggarbagecyclesufixgarbage
  • returns g:=> occurrences of cycle are not repeated adjacently, g repeats twice adjacently

Example 3:

  • prefixgarbagecyclecyclepadinggarbageprefixgarbage
  • returns cycle:=> cycle & g repeat twice adjacently but cycle is longer then g

Example 4:

  • prefixgarbagecyclecyclecycleroundroundroundprefixgarbage
  • returns cycle:=> cycle & round repeat thrice adjacently & same length but cycle appeared first

Exampe 5:

  • abcdefghijklmnopqrstuvqxyz
  • returns <empty string> because there is no repeated adjacent substring

What is the best approach for implementing this algorithm?


Solution

  • Cannot find anything better than this quadratic-time algorithm (implemented in Python):

    IREP, IPER, IPOS = 0, 1, 2
    
    def find_dominant(src):
      best = (0, 0, 0) # repetitions-1, period, position
    
      period = 0
      while period < len(src) // max(2, 1 + best[IREP]):
        period += 1
        length = 0
    
        for pos in range(len(src) - 1 - period, -1, -1):
          if src[pos] == src[pos + period]:
            length += 1
            repetitions = length // period
            if repetitions >= best[IREP]:
              best = (repetitions, period, pos)
          else:
            length = 0
    
      return best
    
    
    s = "prefixgarbagecyclecyclecyclecyclesufixgarbage"
    res = find_dominant(s)
    if res[0] == 0:
      print("nothing found")
    else:
      print(res[IREP] + 1, '*', s[res[IPOS]: res[IPOS] + res[IPER]])
    

    For each possible period scan the string and remember the longest periodic subsequence. Scan it backwards to check less conditions. Stop increasing period when no further improvement could be found.

    Time complexity is O(N2 / R), where R is the number of repetitions of dominant substring. Space complexity is O(1).