Given a set of intervals, I would like to find non-overlapping distinct intervals from a set of intervals.
For example:
Input: [[1,10], [5,20], [6,21], [17,25], [22,23], [24,50] ,[30,55], [60,70]]
Output: [[1,5], [21,22], [23,24], [25,30], [50,55], [60,70]]
How can I do that?
What I have currently tried:
gene_bounds_list = [[1,10],[5,20], [6,21],[17,25],[22,23], [24,50],[30,55],[60,70]]
overlap_list = []
nonoverlap_list = []
nonoverlap_list.append(gene_bounds_list[0])
for i in range(1, len(gene_bounds_list)):
curr_gene_bounds = gene_bounds_list[i]
prev_gene_bounds = nonoverlap_list[-1]
if curr_gene_bounds[0]<prev_gene_bounds[0]:
if curr_gene_bounds[1]<prev_gene_bounds[0]: #case1
continue
if curr_gene_bounds[1] < prev_gene_bounds[1]: #case2
nonoverlap_list[-1][0] = curr_gene_bounds[1]
if curr_gene_bounds[1]>prev_gene_bounds[1]:
# previous gene was completely overlapping within current gene,
# so replace previous gene by current (bigger) gene and put previous gene into overlap list
overlap_list.append(nonoverlap_list[-1])
new_bound = [gene_bounds_list[i][0], gene_bounds_list[i][1]]
nonoverlap_list.pop()
nonoverlap_list.append([new_bound[0], new_bound[1]])
elif curr_gene_bounds[0] > prev_gene_bounds[0] and curr_gene_bounds[1] < prev_gene_bounds[1]:
# completely within another gene
overlap_list.append([curr_gene_bounds[0], curr_gene_bounds[1]])
elif curr_gene_bounds[0] < prev_gene_bounds[1]:
# partially overlapping with another gene
new_bound = [nonoverlap_list[-1][1], curr_gene_bounds[1]]
nonoverlap_list[-1][1] = curr_gene_bounds[0]
nonoverlap_list.append([new_bound[0], new_bound[1]])
else:
# not overlapping with another gene
nonoverlap_list.append([gene_bounds_list[i][0], gene_bounds_list[i][1]])
unique_data = [list(x) for x in set(tuple(x) for x in gene_bounds_list)]
within_overlapping_intervals = []
for small in overlap_list:
for master in unique_data:
if (small[0]==master[0] and small[1]==master[1]):
continue
if (small[0]>master[0] and small[1]<master[1]):
if(small not in within_overlapping_intervals):
within_overlapping_intervals.append([small[0], small[1]])
for o in within_overlapping_intervals:
nonoverlap_list.append(o) # append the overlapping intervals
nonoverlap_list.sort(key=lambda tup: tup[0])
flat_data = sorted([x for sublist in nonoverlap_list for x in sublist])
new_gene_intervals = [flat_data[i:i + 2] for i in range(0, len(flat_data), 2)]
print(new_gene_intervals)
However, this gives me an output of: [[1, 5], [6, 10], [17, 20], [21, 22], [23, 24], [25, 30], [50, 55], [60, 70]]
Any ideas of how I can remove the unwanted intervals?
Here is a way to do it. The idea is to keep track of the number of layers of intervals at any point. For this, we add one layer when entering an interval, and remove one when exiting.
We start by building sorted lists of starts and ends. In order to identify the if a value is a start or an end, we create tuples (start, 1)
or (end, -1)
.
Then, we merge these two lists, sorting by value, and iterate over the resulting list (using heapq.merge makes this easy). Each time the number of layers changes to 1, we have the start of a non-overlapping interval. When it changes again, it's the end of it.
from heapq import merge
def non_overlapping(data):
out = []
starts = sorted([(i[0], 1) for i in data]) # start of interval adds a layer of overlap
ends = sorted([(i[1], -1) for i in data]) # end removes one
layers = 0
current = []
for value, event in merge(starts, ends): # sorted by value, then ends (-1) before starts (1)
layers += event
if layers ==1: # start of a new non-overlapping interval
current.append(value)
elif current: # we either got out of an interval, or started an overlap
current.append(value)
out.append(current)
current = []
return out
data = [[1,10], [5,20], [6,21], [17,25], [22,23], [24,50] ,[30,55], [60,70]]
non_overlapping(data)
# [[1, 5], [21, 22], [23, 24], [25, 30], [50, 55], [60, 70]]
Note that the expected answer you put in your question is wrong (for example, it contains a 45 that isn't part of the input data)