Search code examples
algorithmsortingoptimizationsimilaritynearest-neighbor

Efficient way to find array with the largest intersection to an input array


Say I have a large set of arrays (can be up to millions in size), and I want to determine (preferably exactly, although approximately is fine) the array in this set with the largest sized intersection with the input, what would be the most efficient way to do this? I will list some solutions that have crossed my mind at the bottom by reducing this into another problem but I am not sure if they are necessarily the best.

This set of arrays can be stored in any data structure, and the arrays can be sorted and stored in any way. The idea is to optimize query time here.

Example: say my set of arrays is (sorted in radix like manner for convenience, can be sorted in any way chosen):

[('a', 'b'), ('a', 'e', 'f'), ('b', 'f', 'g'), ('b', 'j', 'z'), ('d', 'l', 'f'), ('x', 'y', 'z')]

and my input array is:

('a', 'f')

Then the respective intersections are:

[('a'), ('a', 'f'), ('f'), (), ('f'), ()]

So the output would be ('a', 'f'), having the largest intersection of size 2. As a bonus, it would be even better to have the largest K of these, so here, if K = 3, the output would be (in any order):

[('a', 'f'), ('f'), ('a')]

Some possible solutions I have thought of:

  • The size of my domain is restricted, (as in it could be a-z or numbers 1-70, etc) so potentially I can represent these as binary strings, and the challenge now becomes to find the minimum hammington distance, which I can now do with something like locality hashing? For example ('a', 'f') could be represented as 10000100000000000000000000
  • Also using the fact that the domain is restricted, I can create some inverted index with the items in the domain pointing to the different arrays in the set, and then intersect (at least some of) these results for each item in the input array - although I feel like this would be incredibly inefficient (especially if the intersection turns out to be small) - similar to how a google search work, although I don't know the full details of their algorithm

Thankyou to any responses or pointers in the right direction!


Solution

  • Some questions beforehand that I couldn't ask via comment due to lacking reputation:

    1. All arrays are unique but is every array a set itself?
    2. If more than one array shares the largest intersection size do you need to list them all?
    3. Your input might be longer than the longest given array?

    Iteration

    Without hashset I would sort the arrays by length and start with the longest arrays to possibly skip shorter arrays in the end by finding an intersection size that is simply larger or equal to the shorter arrays' sizes.

    If you also sort the arrays themselves you could make use of the Hammington distance but you don't have to sort and convert all arrays at the same time but start with a share of them only. If you don't use the Hammington keep in mind that if you compare your input with an array that is your input in size + 1 you only have to compare until you hit the first comparison where your input's last element is smaller than the current array element.

    a f

    a c k z // since k > f we don't need to compare f and z

    I would think this way would boil down to a complexity of O(n lg n) since sorting the arrays by size would be O(n lg n), calculating the size n * O(1) and doing an inner radix sorting O(n). The comparison itself would be O(n lg n) (not too sure about this one) so the total would be O(n lg n) * 2 + 2 * O(n) => O(n lg n).

    Tree

    Just a rough idea: You could sort all arrays with Radix and transform them into Hemmington and from there fill a tree with them and traverse it until no further traversing would lead to a smaller distance. How efficient this is I have no idea.

    https://stackoverflow.com/a/6390606/9758920