I have a set of many simple globbing patterns and whole words, like this:
s = set(['ALE', 'BREAD*', 'BREAKFAST*', 'BROTH' ...])
I also a large list of words. I want to check if each word in this list matches either a) a globbing pattern in the set or b) a word in the set.
If there were no globbing patterns, I would just do something like:
for word in words:
if word in s:
# do something
But since the set contains globbing patterns as well, it won't find a match if I want to match 'BREADY' to 'BREAD*'
The only way I can think of doing this would be to use a nested for loop to compare each word to each pattern in the set. Is there a way I can check if each word has a match in the set without comparing it with every element in the set?
You should store the full strings you want to match against separately from the prefixes you want to match. For your prefixes, further partition them into sets of equal-length prefixes (i.e. one set of length-1 prefixes, one set of length-2 prefixes, etc.).
i.e.
fullstrings = set(["BREAKFAST", "LUNCH", "DINNER", ...])
prefixes_by_length = {} # dict of length -> prefix string
...
prefixes_by_length[4] = set(["CORN", "DESK", ...])
prefixes_by_length[5] = set(["BREAD", "TABLE", ...])
Full string match is simple - just check if word in fullstrings
.
For prefixes, you would check each length separately, starting from length 1 to the max prefix length you want to match. For each length n
, check if word[:n] in prefixes_by_length[n]
.
This would be a lot of efficient than looping over all your prefixes every time, if you have a lot of them.
for word in words:
if word in fullstrings:
"Match! do something"
for n in prefixes_by_length:
if word[:n] in prefixes_by_length[n]:
"Match! do something"