hashing large sequences of numbers, creating sets of hashes, storing, and comparing similarity of sets using python

I am trying to find the best way to compare large sets of numerical sequences to other large sets, in order to rank them against each other. Maybe the following toy example clarifies the issue, where lists a, b, and c represent shingles of size 3 in a time series.

a = [(1,2,3),(2,3,4),(3,4,5)]
b = [(1,2,3),(2,3,4),(3,4,7),(4,7,8)]
c = [(1,2,3),(2,3,5)]

set_a, set_b, set_c  = set(a), set(b), set(c)

jaccard_ab = float(len(set_a.intersection(set_b)))/float(len(set_a.union(set_b)))
jaccard_bc = float(len(set_b.intersection(set_c)))/float(len(set_b.union(set_c)))
jaccard_ac = float(len(set_a.intersection(se t_c)))/float(len(set_a.union(set_c)))

The similarity among these sets is:

jaccard_ab, jaccard_bc, jaccard_ac
(0.4, 0.2, 0.25)

So in this example, we can see that set a and b are the most similar with a score of 0.4.

I am having a design problem: 1) Since each set will be composed of ~1000 shingles, do I gain speed by transforming every shingle into a unique hash and then comparing hashes? 2) Initially, I have over 10,000 sets to compare so I think I am much better off storing the shingles (or hashes, depending on answer to 1) in a database or pickling. Is this a good approach? 3) As a new set is added to my workflow, I need to rank it against all existing sets and display, let's say, the top 10 most similar. Is there a better approach than the one in the toy example?

Solution

1) Members of a set have to be hashable, so python is already computing hashes. Storing sets of hashes of items would be duplicated effort, so there's no need to do that.

2) The complexity of the set intersection and union is approximately linear. The Jaccard isn't comptationally expensive, and 10,000 sets isn't that many (about 50 million¹ computations). It will probably take an hour to compute your initial results, but it won't take days.

3) Once you have all of your combinations, ranking another set against your existing results means doing only 10,000 more comparisons. I can't think of a simpler way than that.

I'd say just do it.

If you want to go faster, then you should be able to use a multiprocessing approach fairly easily with this dataset. (Each computation is independent of the other ones, so they can all run in parallel).

Here's an example adapted from the concurrent.futures examples (Python3).

import concurrent.futures

data = [
    {(1, 2, 3), (2, 3, 4), (3, 4, 5), ...},
    {(12, 13, 14), (15, 16, 17), ...},
    ...
]

def jaccard(A, B):
    return len(A & B) / len(A | B) 

with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:
    futures = {executor.submit(jaccard, *sets): sets
               for sets in combinations(data, 2)}

    for future in concurrent.futures.as_completed(futures):
        jaccard_index = future.result()
        print(jaccard_index) # write output to a file or database here

[1]:

>>> from itertools import combinations
>>> print(sum(1 for i in combinations(range(10000), 2)))
49995000