algorithm data-structures divide-and-conquer

Maximum number of subsets of overlapping intervals

Given a set of intervals S what is the efficient way to find the number of subsets assigned to each interval from the set S.

say for example

S = (11,75), (14,62), (17,32), (24,48), (31,71), (34,74), (40,97), (41,58)

as for output

(11, 75) => 6 -> (14,62), (17,32), (24,48), (31,71), (34,74), (41,58)  
(14, 62) => 3 -> (17,32), (24,48), (41,58)   
(17, 32) => 0  
(24, 48) => 0  
(31, 71) => 1 -> (41,58)   
(34, 74) => 1 -> (41,58)  
(40, 97) => 1 -> (41,58)  
(41, 58) => 0

Is it possible to get this mapping in o(nlogn) or substantially less than o(n²)?

Solution

There seems to be an O(n*log(n)) way to do this. The intuition is that we need some sort of way to organize the intervals in a way where, at the current step, all intervals that the current one could contain have been considered.

The algorithm is as follows:

Sort the intervals by end times in ascending order, and sort tied end times by their start times in descending order.
Iterate over the intervals and maintain a sorted set of all start times seen. The idea is that, when looking at the current interval, all intervals that it could contain have already been examined, and the number of intervals that the current one does contain is simply the number of elements in our built set that have a start time later than the current one.

Walking through the example, we first find that

sortedIntervals = [(17,32), (24,48), (41,58), (14,62), (31,71), (34,74), (11,75),(40,97)]

And let our sorted set of intervals (sorting now by start time) be

examinedIntervals = []

Now let's step through sortedIntervals

Consider (17,32). examinedIntervals is empty, so it doesn't contain anything.
Consider (24, 48). examinedIntervals = [(17, 32)] . Because there are no intervals that start after 24, we have that (24, 48) contains 0 intervals.
Consider (41, 58). examinedIntervals = [(17, 32), (24, 48)]. No intervals have a start time after 41, so (41, 58) contains 0 intervals
Consider (14, 62). examinedIntervals = [(17, 32), (24, 48), (41, 58)]. Now all three intervals have a start time after 14, so (14, 62) contains 3 intervals
Consider (31, 71). examinedIntervals = [(14, 62), (17, 32), (24, 48), (41, 58)]. Only one interval comes after 31, so (31, 71) contains 1 interval
Consider (34, 74). examinedIntervals = [(14, 62), (17, 32), (24, 48), (31, 71), (41, 58)]. One interval comes after 34, so (34, 74) contains 1 interval
Consider (11, 75). examinedIntervals = [(14, 62), (17, 32), (24, 48), (31, 71), (34, 74), (41, 58)], and all 6 intervals have a start time after 14.
Consider (40, 97). examinedIntervals = [(11, 75), (14, 62), (17, 32), (24, 48), (31, 71), (34, 74), (41, 58)]. Only one interval comes after 40, so (40, 97) contains 1 interval.

Summarizing we do indeed get the correct results:

(40, 97) -> 1
(11, 75) -> 6
(34, 74) -> 1
(31, 71) -> 1
(14, 62) -> 3
(41, 58) -> 0
(24, 48) -> 0
(17, 32) -> 0

It can also be verified easily that runtime is O(n*log(n)), assuming the use of an efficient sort and a balanced tree in the second part. The initial sort runs in the given amount of time. The second portion involves n insertions into a binary tree of height O(log(n)), giving a runtime of O(nlog(n)). Because we're summing two steps that run in O(nlog(n)), the overall runtime is O(nlog(n)).