Search code examples
pythonstringalgorithmmatchingfuzzy-search

find best subset from list of strings to match a given string


I have a string

s = "mouse"

and a list of string

sub_strings = ["m", "o", "se", "e"]

I need to find out what is the best and shortest matching subset of sub_strings the list that matches s. What is the best way to do this? The ideal result would be ["m", "o", "se"] since together they spell mose


Solution

  • You can use a regular expression:

    import re
    
    def matches(s, sub_strings):
        sub_strings = sorted(sub_strings, key=len, reverse=True)
        pattern = '|'.join(re.escape(substr) for substr in sub_strings)
        return re.findall(pattern, s)
    

    This is at least short and quick, but it will not necessarily find the best set of matches; it is too greedy. For example,

    matches("bears", ["bea", "be", "ars"])
    

    returns ["bea"], when it should return ["be", "ars"].


    Explanation of the code:

    • The first line sorts the substrings by length, so that the longest strings appear at the start of the list. This makes sure that the regular expression will prefer longer matches over shorter ones.

    • The second line creates a regular expression pattern consisting of all the substrings, separated by the | symbol, which means “or”.

    • The third line just uses the re.findall function to find all matches of the pattern in the given string s.