python python-2.7 numpy scipy cosine-similarity

pairwise comparisons within a dataset

My data is 18 vectors each with upto 200 numbers but some with 5 or other numbers.. organised as:

[2, 3, 35, 63, 64, 298, 523, 624, 625, 626, 823, 824]
[2, 752, 753, 808, 843]
[2, 752, 753, 843]
[2, 752, 753, 808, 843]
[3, 36, 37, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, ...]

I would like to find the pair that is the most similar in this group of lists. The numbers themselves are not important, they may as well be strings - a 2 in one list and a 3 in another list are not comparable.

I am looking if the variables are the same. for example, the second list is exactly the same as the 4th list but only 1 variable different from list 3.

Additionally it would be nice to also find the most similar triplet or n that are the most similar, but pairwise is the first and most important task.

I hope i have layed out this problem clear enough but i am very happy to supply any more information that anyone might need!

I have a feeling it involves numpy or scipy norm/cosine calculations, but i cant quite work out how to do it, or if this is the best method.

Any help would be greatly appreciated!

Solution

You can use itertools to generate your pairwise comparisons. If you just want the items which are shared between two lists you can use a set intersection. Using your example:

import itertools

a = [2, 3, 35, 63, 64, 298, 523, 624, 625, 626, 823, 824]
b = [2, 752, 753, 808, 843]
c = [2, 752, 753, 843]
d = [2, 752, 753, 808, 843]
e = [3, 36, 37, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112]

data = [a, b, c, d, e]

def number_same(a, b):
    # Find the items which are the same
    return set(a).intersection(set(b))

for i in itertools.permutations([i for i in range(len(data) - 1)], r=2):
    print "Indexes: ", i, len(number_same(data[i[0]], data[i[1]]))

>>>Indexes  (0, 1) 1
Indexes  (0, 2) 1
Indexes  (0, 3) 1
Indexes  (1, 0) 1
Indexes  (1, 2) 4
Indexes  (1, 3) 5  ... etc

This will give the number of items which are shared between two lists, you could maybe use this information to define which two lists are the best pair...