Search code examples
pythonglob

Efficiently checking if a word matches patterns in set (Python)


I have a set of many simple globbing patterns and whole words, like this:

s = set(['ALE', 'BREAD*', 'BREAKFAST*', 'BROTH' ...])

I also a large list of words. I want to check if each word in this list matches either a) a globbing pattern in the set or b) a word in the set.

If there were no globbing patterns, I would just do something like:

for word in words:
    if word in s:
        # do something

But since the set contains globbing patterns as well, it won't find a match if I want to match 'BREADY' to 'BREAD*'

The only way I can think of doing this would be to use a nested for loop to compare each word to each pattern in the set. Is there a way I can check if each word has a match in the set without comparing it with every element in the set?


Solution

  • You should store the full strings you want to match against separately from the prefixes you want to match. For your prefixes, further partition them into sets of equal-length prefixes (i.e. one set of length-1 prefixes, one set of length-2 prefixes, etc.).

    i.e.

    fullstrings = set(["BREAKFAST", "LUNCH", "DINNER", ...])
    prefixes_by_length = {} # dict of length -> prefix string
    ...
    prefixes_by_length[4] = set(["CORN", "DESK", ...])
    prefixes_by_length[5] = set(["BREAD", "TABLE", ...])
    

    Full string match is simple - just check if word in fullstrings.

    For prefixes, you would check each length separately, starting from length 1 to the max prefix length you want to match. For each length n, check if word[:n] in prefixes_by_length[n].

    This would be a lot of efficient than looping over all your prefixes every time, if you have a lot of them.

    for word in words:
        if word in fullstrings:
            "Match! do something"
        for n in prefixes_by_length:
            if word[:n] in prefixes_by_length[n]:
                "Match! do something"