Search code examples
pythonlisttuplesintervalsoverlap

Find longest interval between overlapping intervals in Python?


I have a list of intervals in the form of tuples such as this:

data = [(5,10), (3,10), (13,15), (12,18), (20,29), (25,30)]

Each item in the tuple has 2 values (i.e., start, end) and there may or may not be overlaps between different intervals. If there are overlapping intervals, I want to only take the longest ones. The output in this test would be the following:

output = [(3,10), (12,18), (20,29)]

How can I do this in Python using either the standard library, numpy, or pandas?

I started doing something naive like this but I don't think this will scale well...I also would rather not use NetworkX

import networkx as nx
data = [(5,10), (3,10), (13,15), (12,18), (20,29), (25,30)]

graph = nx.Graph()

n = len(data)
for i, a in enumerate(data):
    a_seq = set(range(a[0], a[1] + 1))
   
    for j in range(i+1, n):
        b = data[j]
        b_seq = set(range(b[0], b[1] + 1))

        n_overlap = len(a_seq & b_seq)
        if n_overlap:
            graph.add_edge(a, b, weight=n_overlap)

output = list()
for nodes in nx.connected_components(graph):
    lengths = dict()
    for node in nodes:
        start, end = node
        lengths[node] = end - start
    longest_interval, length_of_interval = sorted(lengths.items(), key=lambda x:x[1], reverse=True)[0]
    output.append(longest_interval)

I'm assuming there is a better approach but right now it's escaping me.

Edit: There might be some confusion in the task but I can't mix and match intervals (e.g., (20,30) is invalid because it's not one of the starting intervals).


Solution

  • Your post mentioned not using any libraries so that's what I did. Also, you mentioned many clarifications in the comments so this code handles all the edge cases.

    def find_max_length_in_overlap_groups(intervals):
        #get sorted list of intervals by starting point
        aux = sorted(intervals, key=lambda x: x[0])
        new_group = [aux[0]]
        output = []
        for interval in aux[1:]:
            if interval[0] > new_group[-1][-1]:
            #if the interval does not overlap with the last interval we start a new group
                output.append(
                    #find interval of maximum length
                    max(new_group, key=lambda interval: interval[1] - interval[0])
                )
                #start new group
                new_group = [interval]
            else:
                #otherwise just append it onto the same group if it overlaps
                new_group.append(interval)
        #take care of the last group, doing the same as above
        output.append(max(new_group, key=lambda interval: interval[1] - interval[0]))
        return output
    

    This approach is based on a solution for a similar problem found here, with a more complete explanantion. This approach doesn't preserve the original ordering of the output. If you want that you have to modify the function slightly, by storing the original indices and then sorting by them:

    def find_max_length_in_overlap_groups(intervals):
        #store the indices
        indices={val:ind for ind, val in enumerate(intervals)}
        aux = sorted(intervals, key=lambda x: x[0])
        new_group = [aux[0]]
        output = []
        for interval in aux[1:]:
            if interval[0] > new_group[-1][-1]:
                output.append(
                    max(new_group, key=lambda interval: interval[1] - interval[0])
                )
                new_group = [interval]
            else:
                new_group.append(interval)
        output.append(max(new_group, key=lambda interval: interval[1] - interval[0]))
        #sort the output according to indices in input to preserve order
        return sorted(output, key = lambda x: indices[x])
    

    Both algorithms have the same time complexity: O(n log n) due to the sorting