Search code examples
pythonnumpymachine-learningvectorization

I am trying to vectorize an operation using numpy.vectorize but I can't seem to make it work


np.corrcoef takes two arguments and they must have the same dimensions. In my case datax is a n by n array and datay is a n by 1 array. I want to vectorize this operation so I don't have to use loops to find my results. I think np.vectorize is my answer but nothing I have tried gives me a result. Here is my last try in things I have tried:

def f(datax, datay):
            return np.corrcoef(data,datay)

    result = np.vectorize(f, dtype=np.ndarray)

Solution

  • np.vectorize() is not really for performance. Most numpy operations are vectorized anyway.

    I assume you're trying to calculate correlations to y columnwise.

    Let's test it out (I used a small 400-ish lines dataframe), a naive for loop would be relatively slow indeed:

    %%timeit
    [np.corrcoef(X_train[:,i], y_train)[0,1] for i in range(10)]
    459 µs ± 1.45 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
    

    A 'proper' vectorized version should do something like:

    def f(datax, datay):
                return np.corrcoef(datax, datay, rowvar=False)
    result = np.vectorize(f, signature="(m,n),(m)->(k,k)")
    
    %%timeit
    result(X_train, y_train)[-1,0:X_train[0].size]
    121 µs ± 84.4 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
    

    Much better! But alas, np.corrcoef() is already better vectorized:

    %%timeit
    np.corrcoef(X_train, y_train, rowvar=False)[-1,0:X_train[0].size]
    64.7 µs ± 1.54 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
    

    That's basically twice as fast.

    If you really wish to speed it up however, einsum comes to mind: (adapted from this question)

    def columnwisecorrcoef(O, P):
        n = np.double(P.size)
        DO = O - (np.einsum('ij->j', O) / n)
        PO = P - (np.einsum('i->', P) / n)
        tmp = np.einsum('ij,ij->j', DO, DO)
        tmp *= np.einsum('i,i->', PO, PO)
        return np.dot(PO, DO) / np.sqrt(tmp)
        
    %%timeit
    columnwisecorrcoef(X_train, y_train)
    24.8 µs ± 45.1 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)