Search code examples
pythonnumpyhistogramsparse-matrix

Efficient Histogram of Differences for sparse Data


I want to compute a histogram of the differences between all the elements in one array A with all the elements in another array B.

So I want to have a histogram of the following data:

Delta1 = A1-B1
Delta2 = A1-B2
Delta3 = A1-B3
...
DeltaN = A2-B1
DeltaN+1 = A2-B2
DeltaN+2 = A2-B3 
...

The point of this calculation is to show that these data has a correlation, even though not every data point has a "partner" in the other array and the correlation is rather noisy in practice.

The problem is that these files are in practice very large, several GB and all entries of the vectors are 64 bit integer numbers with very large differences. It seems unfeasible to me to convert these data to binary arrays in order to be able to use correlation functions and fourier transforms to compute this.

Here is a small example to give a better taste of what I'm looking at. This implementation with numpy's searchsorted in a for loop is rather slow.

import numpy as np
import matplotlib.pyplot as plt

timetagsA = [668656283,974986989,1294941174,1364697327,\
1478796061,1525549542,1715828978,2080480431,2175456303,2921498771,3671218524,\
4186901001,4444689281,5087334517,5467644990,5836391057,6249837363,6368090967,8344821453,\
8933832044,9731229532]


timetagsB = [13455,1294941188,1715828990,2921498781,5087334530,5087334733,6368090978,9731229545,9731229800,9731249954]

max_delta_t = 500
nbins = 10000 
histo=np.zeros((nbins,2), dtype = float) 
histo[:,0]=np.arange(0,nbins)   

for i in range(0,int(len(timetagsA))):
    delta_t = 0
    j = np.searchsorted(timetagsB,timetagsA[i]) 
    while (np.round(delta_t) < max_delta_t and j<len(timetagsB)):
        delta_t = timetagsB[j] - timetagsA[i] 

        if(delta_t<max_delta_t):
            histo[int(delta_t),1]+=1 

        j = j+1

plt.plot(histo[0:50,1])
plt.show()

It would be great if someone could help me to find a faster way to compute this. Thanks in advance!


Solution

  • EDIT

    The below solution is supposing that your data is so huge that you can not use np.substract with np.outer and then slice the value you want to keep:

    arr_diff = np.subtract.outer(arrB, arrA)
    print (arr_diff[(0<arr_diff ) &(arr_diff <max_delta_t)])
    # array([ 14,  12,  10,  13, 216,  11,  13, 268], dtype=int64)
    

    with your example data it works but not with too huge data set

    ORIGINAL SOLUTION

    Let's first suppose your max_delta_t is smaller than the difference between two successive values in timetagsB for an easy way of doing it (then we can try to generalize it).

    #create the array instead of list
    arrA = np.array(timetagsA)
    arrB = np.array(timetagsB)
    max_delta_t = np.diff(arrB).min() - 1 #here it's 202 just for the explanation
    

    You can use np.searchsorted in a vectorize way:

    # create the array of search 
    arr_search = np.searchsorted(arrB, arrA) # the position of each element of arrA in arrB
    print (arr_search)
    # array([1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 6, 6, 6, 6, 7, 7, 7],dtype=int64)
    

    You can calculate the difference between the element of arrB corresponding to each element of arrA by slicing arrB with arr_search

    # calculate the difference
    arr_diff = arrB[arr_search] - arrA
    print (arr_diff[arr_diff<max_delta_t]) # finc the one smaller than max_delta_t
    # array([14, 12, 10, 13, 11, 13], dtype=int64)
    

    So what you are looking for is then calculated by np.bincount

    arr_bins = np.bincount(arr_diff[arr_diff<max_delta_t])
    #to make it look like histo but not especially necessary
    histo = np.array([range(len(arr_bins)),arr_bins]).T
    

    Now the problem is that, there is some values of difference between arrA and arrB that could not be obtained with this method, when max_delta_t is bigger than two successive values in arrB. Here is one way, naybe not the most efficient depending on the values of your data. For any value of max_delta_t

    #need an array with the number of elements in arrB for each element of arrA 
    # within a max_delta_t range
    arr_diff_search = np.searchsorted(arrB, arrA + max_delta_t)- np.searchsorted(arrB, arrA)
    #do a loop to calculate all the values you are interested in
    list_arr = []
    for i in range(arr_diff_search.max()+1):
        arr_diff = arrB[(arr_search+i)%len(arrB)][(arr_diff_search>=i)] - arrA[(arr_diff_search>=i)]
        list_arr.append(arr_diff[(0<arr_diff)&(arr_diff<max_delta_t)])
    

    Now you can np.concatenate the list_arr and use np.bincount such as:

    arr_bins = np.bincount(np.concatenate(list_arr))
    histo = np.array([range(len(arr_bins)),arr_bins]).T