Search code examples
stringalgorithmanalysis

Find identical phrases in strings


I have sets of texts. The texts are rather short (up to 500 characters). The sets consist of up to 10 texts. I try to find phrases as long as possible which occur in most of the texts. In other words I am looking for identical substrings. The longer the better and the more texts they occur in also the better.

Example:

  • "The red brown fox jumps over the lazy dog"
  • "The red haired girl smokes brown cigars"
  • "Where the yellow fox jumps over the haywire"
  • "All the boys like a red brown fox"
  • "A girl like a fox jumps over the dead boys grave"

Phrases (one word phrases ommitted):

  • "red brown fox", length 12, occurence 2
  • "fox jumps over the", length 18, occurrence 3
  • "The red", length 7, occurrence 2

Phrases like "brown fox" or "fox jumps" are omitted, because they are subphrases of longer phrases.

I am looking for an algorithm to find those phrases.


Solution

  • Finding the longuest commun substring is a commun DP algorithm explained pretty well here. https://www.geeksforgeeks.org/longest-common-substring-dp-29/ . After to find the occurence of the strings in the set of text you can simply use a code loke this.

    substring = "red brown fox"
    n = 0
    for text in texts:
        if substring in text:
           n = n + 1
    print(substring, n)