I have for example the following python lists:
a = [1,2,1,3,2,1,1,1,2,1,1,1,1,1,3,1,2]
b = [1,1,2,1,3,1,1,1,1,2,2,1,1,1,1,3,1,2]
and I'd like to obtain the tuples of indices of the elements that can be confidently matched, such as:
[(0,0), (1,2), (2,3), (3,4), (8,9), (14,15), (15,16), (16,17)]
The data represent the sizes of groups of people recorded arriving at, and leaving a queue, but the data isn't perfect either, so the sums of a and b don't match, and people don't always need to leave in the order they arrive.
I realise it depends on several variables (or threshold parameters), but am just looking for suggestions about how best to approach the problem. I'm happy to use Pandas/Numpy/Scipy to do the job.
I've realised it's quite hard to explain. By eye, it's easy for me to match the partial sequences, such as 1,2,1,3. Not finding it so easy to work out a good algorithmic approach though.
I finally realised Python has the difflib library for just this kind of thing!
a = [1, 2, 1, 3, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 3, 1, 2]
b = [1, 1, 2, 1, 3, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 3, 1, 2]
from difflib import SequenceMatcher
s = SequenceMatcher(None, a, b, autojunk=False)
matched_element_indices = []
for m in s.get_matching_blocks():
matched_element_indices += [(m.a+i,m.b+i) for i in range(m.size)]
It produces this:
In : matched_element_indices
Out: [(0, 1), (1, 2), (2, 3), (3, 4), (5, 6), (6, 7), (7, 8), (8, 9),
(10, 11), (11, 12), (12, 13), (13, 14), (14, 15), (15, 16), (16, 17)]