There are some keywords in text and start/end position of their occurrences. Keywords may partially overlap e.g. "something" -> "something"/"some"/"thing":
keywords_occurences = {
"key_1": [(11, 59)],
"key_2": [(24, 46), (301, 323), (1208, 1230), (1673, 1695)],
"key_3": [(24, 56), (1208, 1240)],
...
}
I need to select one position for each keyword so they don't overlap, for this case solution is:
key_1: 11-59
key_2: 301-323 (or 1673-1695, it does not matter)
key_3: 1208-1240
If this could not be done, select maximum numbers of unique non-overlapping keys.
Looks like a kind of "exact hitting set" problem, but I can't find algorithm description.
I think the following code does what you want.
#!/usr/bin/env python
# keyword occurrences -> [('key_1', (11, 59)), ('key_2', (301, 333)), ('key_3', ())]
kw_all_occ = {"key_1" : [(11, 59)],
"key_2" : [(24, 56), (301, 333), (1208, 1240), (1673, 1705)],
"key_3" : [(24, 46), (1208, 1230)]}
def non_overlapping_occ(occ):
# dictionary with all keyword occurrences
all_occ = dict({})
all_occ.update(occ)
# list with the first non overlapping occurrences of every keyword ->
# [('key_1', (start_1, end_1)),('key_2', (start_2, end_2)),...]
result = []
# Sort keywords by length -> [(22, 'key_3'), (32, 'key_2'), (48, 'key_1')]
kw_lengths = []
for k, v in all_occ.iteritems():
kw_lengths.append((v[0][1] - v[0][0], k))
kw_lengths.sort()
while len(kw_lengths):
# Current longest keyword
longest_keyword = kw_lengths.pop(-1)[1]
try:
result.append((longest_keyword, all_occ[longest_keyword][0]))
# Remove occurrences overlapping any occurrence of the current
# longest_keyword value
for item in all_occ[longest_keyword]:
start = item[0]
end = item[1]
for l, k in kw_lengths:
v = all_occ[k]
all_occ[k] = filter(lambda x: (x[0] > end) | (x[1] < start), v)
except IndexError:
result.append((longest_keyword, ()))
return result
print non_overlapping_occ(kw_all_occ)
It produces the following output:
vicent@deckard:~$ python prova.py
[('key_1', (11, 59)), ('key_2', (301, 333)), ('key_3', ())]
Note that I'm not using sets in the code. Your question just suggest that sets could help to solve the problem so I understand that using sets isn't mandatory for you.
Also note that the code has not been deeply tested but it seems to work just fine (it also deals properly with the keywords occurrences provided in your question. In fact, those occurrences can be solved with some simpler but less general code).