Search code examples
pythonrangeaggregateunionoverlap

How to "union" overlapping range to non-overlapping range?


Question: Can anyone suggest a better or more pythonic approach, to reducing overlapping range pairs to non overlapping range pairs?

Background: I have a list of tuples representing start and end pairs. I am trying to essentially complete a union of all the start ends pairs. The input start end pairs have overlapping values and the output should represent the input start end pairs without any overlap.

The code below is close but wrong as it outputs an extra range that was not in the input (I also realize it is not very good, and why its wrong). Can anyone suggest a better approach, or some built in function I overlooked?

Apologies for the basic question. Thanks for the help!

##create example data
pairA =[(0,5),(10,12)]
pairB =[(1,2),(11,15)]
pairC =[(1,4),(10,12),(15,17)]

#combine the lists to one list
#ultimately may have n number of lists and felt it would be easier to
merged = pairA + pairB +pairC
# produce union of list * unpacks the arguments of a list
listUnion= sorted(set().union(*merged))

#this is the piece of code I am looking at improving
#it creates new start end pairs based on the union
lastElement =listUnion[-1]
outList=[]

for item in listUnion:
    #create start end pair from value i and i+1
    if item != lastElement:
        outList.append((item,listUnion[listUnion.index(item)+1]))
    else:
        #last element of the list, becomes the last element of list pair
        #it can be ignored
        pass
print outList 
"""output: [(0, 1), (1, 2), (2,4), (4, 5), (5, 10), (10, 11), (11, 12), (12, 15), (15, 
17)]
correct output: would not have (5,10) as there is no overlap here in the input """

Edit: Added this visual representation of the problementer image description here


Solution

  • Here is a solution. It's probably not very pythonic, because my experience with Python is very limited, but it works.

    pairs_a = [(0, 5), (10, 12)]
    pairs_b = [(1, 2), (11, 15)]
    pairs_c = [(1, 4), (10, 12), (15, 17)]
    
    merged = pairs_a + pairs_b + pairs_c
    merged.sort()
    
    set_list = []
    cur_set = set()
    cur_max = merged[0][1]
    for pair in merged:
        p0, p1 = pair
        if cur_max < p0:
            set_list.append(cur_set)
            cur_set = set()
        cur_set.add(p0)
        cur_set.add(p1)
        if cur_max < p1:
            cur_max = p1
    set_list.append(cur_set)
    
    out_list = []
    for my_set in set_list:
        my_list = sorted(my_set)
        p0 = my_list[0]
        for p1 in my_list[1:]:
            out_list.append((p0, p1))
            p0 = p1
    
    # more pythonic but less readable in spite of indentation efforts:
    # out_list = [pair
    #             for zipped in [zip(list[:-1], list[1:])
    #                            for list in [sorted(set)
    #                                         for set in set_list]]
    #                 for pair in zipped]
    
    # alternate ending:
    # out_list = [sorted(set) for set in set_list]
    
    print(out_list)
    

    The idea is to sort all range pairs by the first item first. This is what merged.sort() does (it uses successive tuple members to disambiguate, but this is unimportant here). Then we loop over the sorted range pairs, and as long as we are within a bunch of overlapping ranges, we add all starts and ends to the current set. In order to know when the bunch ends, we keep the max of all range ends. As soon as a range start arrives that is beyond this max, we store away the current set by appending it to a list, and begin a new one. The last set has to be added to the list after the loop. Now we have a list of sets, which we can easily translate to a list of lists or to a list of pairs.