Search code examples
pythonstringlistlongest-substring

Finding *modal* substring of a list of strings in python


Finding a common substring has been answered in many questions, i.e. given a list of strings find the (typically longest or largest number of words) substring that is common to all of them. See here:

Longest common substring from more than two strings - Python

My question is, how does one go about finding the longest modal substring in a list of strings? The important stipulation here is that this substring does not necessarily have to appear in all strings in the list.

There is a little bit of art to the science here because the obvious tradeoff is between 1) how many strings do you want the substring to appear in? and 2) how long do you want the substring to be? To fix ideas, lets just assume that we want the desired substring to have three words (in case of a tie here, take the longest string, followed by first instance).

So given the list,

mylist = ["hey how's it going?", "I'm fine  thanks.", "Did you get that thing?", "Of course I got that thing, that's why I asked you: how's it going?"]

The desired output is,

"how's it going"

If the stipulation was instead two words long then the desired output would be,

"that thing"

Since "that thing" is a longer string than, "how's it" or "it going"

The above answers in code are the modal substrings of three and two words long, respectively.


EDIT:

Since there is a bounty on this I will be a little more specific with what a modal sub-string is.

Modal substring: For a given length of words in the substring (this is needed to uniquely identify the modal substring), the modal substring is the substring that is common to the largest number of strings in the list. If there is a tie (i.e. for a given length of three words in the substring there are two candidate substrings that both appear in 80% of the strings) then the substring with the longest character length should be used. If there is a still a tie after that (which should be very unlikely but is good to account for), then just take the first one or choose randomly.


A good answer would have a function that returns the modal substring for a given number of words in the substring (where the number of words can be an arbitrary number).

An incredible answer would dispense with the 'given number of words' restriction and instead include a scalar (say \alpha) that manages the tradeoff between substring length (in words) and number of times it appears in the list. Alpha close to 1 would choose a modal substring that is very long (in words) but doesn't necessarily appear many times in the list. Alpha close to 0 would choose a modal substring that appears in as many times in the list as possible and doesn't care about substring length. I am not really expecting this though and will accept an answer that answers the original question.


Solution

  • Finding modal substring of given length

    Our input is as follows:

    mylist = ["hey how's it going?", "I'm fine  thanks.", "Did you get that thing?", "Of course I got that thing, that's why I asked you: how's it going?"]
    

    First, we'll define a preprocessing function, to clean up extra white spaces and punctuation. This makes it easier to extract words from the sentences:

    def preprocess(strings):
        # used this answer to get rid of extra whitespaces: https://stackoverflow.com/a/1546251/6735980
        return [" ".join(string.split()).replace(",", "").replace(".", "").replace("?", "") for string in strings]
    

    We'll also define a helper function to find n-grams (substrings of n words in a sentence):

    def find_n_grams(string, n):
        words = string.split(" ")
        n_grams = []
    
        for i in range(len(words) - n + 1):
            n_grams.append(" ".join([words[i + idx] for idx in range(n)]))
    
        return n_grams
    

    Then, we can use the following function to find a "modal substring" as defined in the question, for a given number of words. I am not sure if this is the most optimal/efficient way to compute it, but it does the job:

    def find_modal_substring(strings, num_words):
        n_grams_per_string = [find_n_grams(string, num_words) for string in strings]
        max_num_occurences = 0
        modal_substring = None
    
        for i in range(len(strings)):
            n_grams = n_grams_per_string[i]
    
            for n_gram in n_grams:
                num_occurences = 1
    
                for j in range(i + 1, len(strings)):
                    if n_gram in n_grams_per_string[j]:
                        num_occurences += 1
    
                if num_occurences > max_num_occurences:
                    max_num_occurences = num_occurences
                    modal_substring = n_gram
                elif num_occurences == max_num_occurences and len(modal_substring) < len(n_gram):
                    max_num_occurences = num_occurences
                    modal_substring = n_gram
    
        return modal_substring
    

    Some input/output pairs:

    > print(find_modal_substring(preprocess(mylist), 3))
    

    how's it going

    > print(find_modal_substring(preprocess(mylist), 2))
    

    that thing


    Finding modal substring with alpha parameter

    The most difficult thing in this part of the question is how to define a score function given the alpha parameter. We know that:

    Alpha close to 1 would choose a modal substring that is very long (in words) but doesn't necessarily appear many times in the list. Alpha close to 0 would choose a modal substring that appears in as many times in the list as possible and doesn't care about substring length.

    One score function that satisfies this is the following, but we can probably think of many others as well. By isolating it in a function, anyone can easily modify it to their needs:

    def compute_score(n_gram, occurences, alpha):
        return alpha * len(n_gram.split(" ")) + (1.0 - alpha) * occurences
    

    Given that, the function to find an alpha-based modal substring is rather long because it first needs to find all possible n-grams for all n, but here we go:

    def find_modal_substring_alpha(strings, alpha):
        n_grams_per_n_per_string = []
        n = 1
    
        while True:
            n_grams_per_string = [find_n_grams(string, n) for string in strings]
    
            if all(n_grams == [] for n_grams in n_grams_per_string):
                break
            else:
                n_grams_per_n_per_string.append(n_grams_per_string)
    
            n += 1
    
        best_modal_substring = None
        best_score = 0.0
    
        for n_grams_per_string in n_grams_per_n_per_string:
            for i in range(len(strings)):
                n_grams = n_grams_per_string[i]
    
                for n_gram in n_grams:
                    num_occurences = 1
    
                    for j in range(i + 1, len(strings)):
                        if n_gram in n_grams_per_string[j]:
                            num_occurences += 1
    
                    score = compute_score(n_gram, num_occurences, alpha)
    
                    if score > best_score:
                        best_score = score
                        best_modal_substring = n_gram
                    elif score == best_score and len(best_modal_substring) < len(n_gram):
                        best_score = score
                        best_modal_substring = n_gram
    
        return best_modal_substring
    

    It turns out that alpha is a bit difficult to tune here. Even for a relatively low alpha (0.3), we get exactly the same as we would get for 1.0:

    > print(find_modal_substring_alpha(preprocess(mylist), 0.3))
    

    Of course I got that thing that's why I asked you: how's it going

    If we move alpha all the way down to 0.01 for example, we get:

    > print(find_modal_substring_alpha(preprocess(mylist), 0.01))
    

    how's it going