Search code examples
pythonarraysnumpyarray-intersect

Find intersecting values in multiple numpy arrays


I have 100 large arrays > 250,000 elements each. I want to find common values that are found in these arrays. I know that there are not going to be values that are found in all 100 arrays, but a small number values will be found in multiple arrays (I suspect 10-30%). I want to find which values are found with the highest frequency across these arrays. (Side point: arrays have no duplicates)

I know that I can loop through the arrays and eventually find them, but that will take a while. I also know about the np.intersect1d function, but I that only gives values that are found within all of the arrays, whereas I'm looking for values that are only going to be in around 20 of the 100 arrays.

My best bet is use the np.intersect1d function and loop through all possible combinations of the arrays, which would definitely take a while, but not as long as simply looping through all 250,000 x 100 values. Example:

array_1 = array([1.98,2.33,3.44,,...11.1)
array_2 = array([1.26,1.49,4.14,,...9.0)
array_2 = array([1.58,2.33,3.44,,...19.1)
array_3 = array([4.18,2.03,3.74,,...12.1)
.
.
. 
array_100= array([1.11,2.13,1.74,,...1.1)

No values in all 100, Is there a value that can be found in 30 different arrays?


Solution

  • You can either use np.unique with the return_counts keyword, or a vanilla Python Counter.

    The first option works if you can concatenate your arrays into a single 250k x 100 monolith, or even string them out over after the other:

    unq, counts = np.unique(monolith, return_counts=True)
    ind = np.argsort(counts)[::-1]
    unq = unq[ind]
    counts = counts[ind]
    

    This will leave you with an array containing all the unique values, and the frequency with which they occur.

    If the arrays have to remain separate, use collections.Counter to accomplish the same task. In the following, I assume that you have a list containing your arrays. It would be very pointless to have a hundred individually named variables:

    c = Counter() for arr in arrays: c.update(arr)

    Now c.most_common will give you the most common elements and their counts.