Say I have a large set of arrays (can be up to millions in size), and I want to determine (preferably exactly, although approximately is fine) the array in this set with the largest sized intersection with the input, what would be the most efficient way to do this? I will list some solutions that have crossed my mind at the bottom by reducing this into another problem but I am not sure if they are necessarily the best.
This set of arrays can be stored in any data structure, and the arrays can be sorted and stored in any way. The idea is to optimize query time here.
Example: say my set of arrays is (sorted in radix like manner for convenience, can be sorted in any way chosen):
[('a', 'b'), ('a', 'e', 'f'), ('b', 'f', 'g'), ('b', 'j', 'z'), ('d', 'l', 'f'), ('x', 'y', 'z')]
and my input array is:
('a', 'f')
Then the respective intersections are:
[('a'), ('a', 'f'), ('f'), (), ('f'), ()]
So the output would be ('a', 'f')
, having the largest intersection of size 2. As a bonus, it would be even better to have the largest K
of these, so here, if K = 3, the output would be (in any order):
[('a', 'f'), ('f'), ('a')]
Some possible solutions I have thought of:
('a', 'f')
could be represented as 10000100000000000000000000
Thankyou to any responses or pointers in the right direction!
Some questions beforehand that I couldn't ask via comment due to lacking reputation:
Without hashset I would sort the arrays by length and start with the longest arrays to possibly skip shorter arrays in the end by finding an intersection size that is simply larger or equal to the shorter arrays' sizes.
If you also sort the arrays themselves you could make use of the Hammington distance but you don't have to sort and convert all arrays at the same time but start with a share of them only. If you don't use the Hammington keep in mind that if you compare your input with an array that is your input in size + 1 you only have to compare until you hit the first comparison where your input's last element is smaller than the current array element.
a f
a c k z // since k > f we don't need to compare f and z
I would think this way would boil down to a complexity of O(n lg n) since sorting the arrays by size would be O(n lg n), calculating the size n * O(1) and doing an inner radix sorting O(n). The comparison itself would be O(n lg n) (not too sure about this one) so the total would be O(n lg n) * 2 + 2 * O(n) => O(n lg n).
Just a rough idea: You could sort all arrays with Radix and transform them into Hemmington and from there fill a tree with them and traverse it until no further traversing would lead to a smaller distance. How efficient this is I have no idea.