Search code examples
pythonstringpython-3.xlistprefix

Finding longest prefix present on at least 2 elements on a list of "random" strings


Given a list of strings, for example:

myList = ["foo", "foobar", "football", "footbag", "bar"]

find the longest prefix present on at least 2 strings on the list:

Longest prefix is "footba" present in "football" and "footbag"

The list will be filled through inputs, and not all of them will have a common prefix.

To be considered an option, it's enough that the prefix is present on two strings in the list. If there are multiple options, it has to return the longest one.

In my research I've been able to find how to get the longest common prefix on all strings, for example:

List: ["foo_a","foo_b","foo_c","fnord"]

Output: Longest common prefix is "f"

However, strings on my list might not even start with the same letter.


Solution

  • This is a messy implementation, and won't be efficient for large lists, but it gets the job done. I'd suggest looking into the prefix tries that were mentioned though if you have a bit of time, they'll work a lot better.

    This works from the full size word backwards, until two words share the same prefix. It truncates the end of the words and counts how many times the same word appears, and if it's at least 2 it returns it.

    from collections import defaultdict
    
    def list_to_text(x):
        x = list(map(str, x))
        first = '", "'.join(x[:-1]) #Join together (a, b, c)
        if first:
            return '" and "'.join((first, x[-1])) #Add the last element (a, b, c and d)
        return x[0] #Return a single value if list length is 1
    
    def find_longest_prefix(x):
        x_max = len(max(x, key=len))
        for i in range(1, x_max)[::-1]:
    
            #Chop off the end of every word
            trim = [j[:i] for j in x]
    
            #Iterate through every unique value
            result = defaultdict(list)
            for j in set(trim):
                result[trim.count(j)].append(j)
    
            #Finish iterating if there are more than 2 words that share a prefix
            highest_count = max(result)
            if highest_count >= 2:
                prefix = result[highest_count]
                words = [j for k in prefix for j in x if j.startswith(k)]
                return prefix, words
    
    myList = ["foo", "foobar", "football", "footbag", "bar"]
    prefix, words = find_longest_prefix(myList)
    
    #Put together the string
    print('The longest common prefix{} "{}" present in "{}".'.format(' is' if len(prefix) ==1 else 'es are',
                                                                     list_to_text(prefix), list_to_text(words)))
    

    It will format the string a little based on the number of results. Your list will still result in this:

    The longest common prefix is "footba" present in "football" and "footbag".
    

    But adding another prefix with the same length and number of results will result in something like this:

    The longest common prefixes are "footba" and "testin" present in "football", "footbag", "testing" and "testin".