Search code examples
pythontuplesoverlapping-matches

Collapse a list of range tuples into the overlapping ranges


I'm looking for the most memory efficient way to solve this problem.

I have a list of tuples representing partial string matches in a sentence:

[(0, 2), (1, 2), (0, 4), (2,6), (23, 2), (22, 6), (26, 2), (26, 2), (26, 2)]

The first value of each tuple is the start position for the match, the second value is the length.

The idea is to collapse the list so that only the longest continue string match is reported. In this case it would be:

[(0,4), (2,6), (22,6)]

I do not want just the longest range, like in algorithm to find longest non-overlapping sequences, but I want all the ranges collapsed by the longest.

In case your wondering, I am using a pure python implementation of the Aho-Corasick for matching terms in a static dictionary to the given text snippet.

EDIT: Due to the nature of these tuple lists, overlapping but not self-contained ranges should be printed out individually. For example, having the words betaz and zeta in the dictionary, the matches for betazeta are [(0,5),(4,8)]. Since these ranges overlap, but none is contained in the other, the answer should be [(0,5),(4,8)]. I have also modified the input dataset above so that this case is covered.

Thanks!


Solution

  • import operator
    lst = [(0, 2), (1, 2), (0, 4), (2,6), (23, 2), (22, 6), (26, 2), (26, 2), (26, 2)]
    lst.sort(key=operator.itemgetter(1))
    for i in reversed(xrange(len(lst)-1)):
        start, length = lst[i]
        for j in xrange(i+1, len(lst)):
            lstart, llength = lst[j]
            if start >= lstart and start + length <= lstart + llength:
                del lst[i]
                break
    print lst
    #[(0, 4), (2, 6), (22, 6)]