I'm working on an NLP project and I have a list with character spans in some text. This list could look like the following:
[(1,4),(1,7),(4,9),(8,15)]
So my task is to return all non-overlapping pairs.
If two or more number pairs are overlapping, then the pair with the longest span should be returned. In my example, I want to return [(1,7),(8,15)]
. How can I do this?
EDIT
I don't want to merge my intervals like in Merge overlap here. But i wan't to return all pairs/intervals/tuples except if the values in some tuples overlaps. e.g. (1,4) and (1,7) overlaps, (4,9) overlaps with (1,4) and (1,7) . If there is some overlap i want to return the tuple with the largest span e.g. (1,7) = span 7, (1,4) = span 4, (4,9) = span 5. That means that it should return (1,7) and (8,15) as well since (8,15) are not overlapping (1,7)
Haven't thought about it too much, but the following approach should do the trick:
spans = [(1,4),(1,7),(4,9),(8,15)]
del_in = []
for x in spans:
if spans.index(x) in del_in: continue
for y in spans:
if spans.index(y) in del_in: continue
if x == y: continue
if len(set(list(range(x[0],x[1]+1))) & set(list(range(y[0],y[1]+1)))) > 0:
if len(list(range(x[0],x[1]+1))) > len(list(range(y[0],y[1]+1))):
del_in.append(spans.index(y))
spans.pop(spans.index(y))
elif len(list(range(y[0],y[1]+1))) > len(list(range(x[0],x[1]+1))):
del_in.append(spans.index(x))
spans.pop(spans.index(x))
print(spans)
Would output:
[(1, 7), (8, 15)]
Short explanation:
We start to iterate through the list of spans (iteration x
). For every tuple, we iterate again through the list (iteration y
), neglecting the tuple from x
(if x == y: continue
). We then transform every tuple into a list (e.g. (1,4)
to [1, 2, 3, 4]
) with list(range())
and compare the lists (as possible with set()
) to see if they have any element in common. If the length of the returned object is bigger than 0, we compare the length of the two lists to see which one we keep (the one with the bigger span). Subsequently, we delete the item with the smaller span from our list (with .pop()
). Since our iterations x
and y
still go through our initial list, we need to implement a mechanism to prevent the script from comparing items with already deleted entries. Hence, we work with an additional list containing die indexes of entries we already deleted (and leave them out in the two for loops).
There might be a better way to do this, but this definitely works for your example.