Search code examples
pythoncorrelationnumpy-ndarray

How to calculate correlation between 1D numpy array and every column of a 2D numpy array


I have a 1D numpy array (y) and 2D numpy array (x) and I calculate correlation between y and every column in x as below:

import numpy as np
from scipy.stats import pearsonr

rng = np.random.default_rng(seed=42)

x = rng.random((3, 3))
y = rng.random(3)

for i in range(x.shape[1]):
    print( pearsonr(x[:, i], y)[0]  )

I was wondering how I can get the correlation values without For loop. Is there any way?


Solution

  • I propose these approaches, all of which lead to the same result as your proposed solution:

    • Approach 1: A solution similar to the one proposed by Lucas M. Uriarte, using numpy.corrcoef:

      np.corrcoef(y,x.T)[0][1:]
      
    • Approach 2: The function for calculating the correlation is rewritten using numpy functions:

      def corr_np(data1, data2):
          mean1 = data1.mean() 
          mean2 = data2.mean()
          std1 = data1.std()
          std2 = data2.std()
          corr = ((data1*data2).mean()-mean1*mean2)/(std1*std2)
          return corr
      
      def paerson_np(x, y):
          return np.array([corr_np(x[:, i], y) for i in range(x.shape[1])])
      
    • Approach 3: The function for calculating the correlation is rewritten, using numba to speed up the calculations:

      @nb.njit()
      def corr_nb(data1, data2):
          M = data1.size
          sum1 = 0.
          sum2 = 0.
          for i in range(M):
              sum1 += data1[i]
              sum2 += data2[i]
          mean1 = sum1 / M
          mean2 = sum2 / M
      
          var_sum1 = 0.
          var_sum2 = 0.
          cross_sum = 0.
          for i in range(M):
              var_sum1 += (data1[i] - mean1) ** 2
              var_sum2 += (data2[i] - mean2) ** 2
              cross_sum += (data1[i] * data2[i])
      
         std1 = (var_sum1 / M) ** .5
         std2 = (var_sum2 / M) ** .5
         cross_mean = cross_sum / M
      
         return (cross_mean - mean1 * mean2) / (std1 * std2)
      
      @nb.njit()
      def paerson_nb(x, y):
          return np.array([corr_nb(x[:, i], y) for i in range(x.shape[1])])
      

    Execution time comparison

    I experimented to see which solution was more efficient, comparing the 3 approaches I listed above and your solution (which I will call approach 0). The instances for the experiments have the following structure:

    import numpy as np
    import numba as nb
    from scipy.stats import pearsonr
    
    rng = np.random.default_rng(seed=42)
    n = 20000
    x = rng.random((n, n))
    y = rng.random(n)
    

    Results:

    • Approch 0 (your solution):

      %timeit approach0(x, y) :-> 15.6 s ± 200 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
      
    • Approach 1:

      %timeit np.corrcoef(y,x.T)[0][1:] :-> 37.4 s ± 3.68 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
      
    • Approach 2:

      %timeit paerson_np(x, y) :-> 19.1 s ± 351 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
      
    • Approach 3:

      %timeit paerson_nb(x, y) :-> 7.81 s ± 56.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
      

    The solution with numba(approch 3), is about 2 times faster than your solution(approach 0) and the solution with numpy(approach 2). The solution with numpy.corrcoef is clearly the slowest: about 2 times slower than aprroaches 0 and 2, and even more than 5 times slower than the solution with numba.