What sort of algorithm/solution could be used to indicate the similarity (overlap/precision/recall/...) of two sets of ranges.
I can think of (or find online) hundreds of similar problems but never exact, but surely this "wheel" must have been invented already...
Lets say that the input data is something like:
Real [ ## ### # ] or [(1,2),(4,6),(9,10)]
Predicted [ ## # ] or [(1,2),(4,4)]
Output should be ~50 %
Should I for example AND bitmaps, use interval trees or what? Is there a nice functional or simple to write algorithm? Any meaningful measure of similarity will do, and so will any reasonable input format.
Thank you.
(realistic length ~4000 with < 50 intervals in each set)
You could break the segments to individual points, and tag each point as real/predicted and start/end.
Then sort the points, go over the sorted list and keep track of overlaps.
You don't even need to track whether the interval was originally from Real
or Predicted
- you just need to keep track whether there's one or two intervals at each point.
Example:
Real [(1,2),(4,6),(9,10)]
Predicted [(1,2),(4,4)]
Broken down into points and sorted (S for Start, E for End):
[(1,S),(1,S),(2,E),(2,E),(4,S),(4,S),(4,E),(6,E),(9,S),(10,E)]
Then going through the array - keep track of how many segments "are open" and count the length of total open
and 2 segments open
.
The result is 2 segments open
/total open
.