Search code examples
pythonarraysnumpyvectorizationarray-broadcasting

How to vectorize an iterative operation on a 2D and 1D NumPy array in Python?


I'm trying to speed up my code to take full advantage of NumPy's vectorization. I've been able to vectorize all of the computations in the majority of my code (clr/vlr functions) which I believe is the optimum for NumPy. I don't believe these can be sped up anymore as np.einsum doesn't really apply. I can't figure out how to vectorize the rho function. In particular, it's this nested for-loop that is giving me trouble: 1 - (ratios[i,j]/ (variances[i] + variances[j])). I've tried using np.meshgrid to try and expand the variances but this didn't work. I've been trying to follow this tutorial on np.einsum but, still, it's a bit confusing and I'm still not sure if it applies to this situation as it involves more operations than dot products and matrix multiplication.

Does anyone know how to vectorize this function?

Extra, are there any suggestions to speed up my code for the previous 2 function?

import numpy as np
import pandas as pd

def clr(X):
    index = None
    labels = None
    if isinstance(X, pd.DataFrame):
        index = X.index
        labels = X.columns
        X = X.values
    X_log = np.log(X)
    geometric_mean = X_log.mean(axis=1)
    X_clr = X_log - geometric_mean[:,np.newaxis]
    if labels is not None:
        X_clr = pd.DataFrame(X_clr, index=index, columns=labels)
    return X_clr

def vlr(X):
    labels = None
    if isinstance(X, pd.DataFrame):
        labels = X.columns
        X = X.values
    n,m = X.shape
    X_log = np.log(X)
    covariance = np.cov(X_log.T)
    diagonal = np.diagonal(covariance)
    output = -2*covariance + diagonal[:,np.newaxis] + diagonal
    if labels is not None:
        output = pd.DataFrame(output, index=labels, columns=labels)
    return output

def rho(X):
    n, m = X.shape
    labels = None
    if isinstance(X, pd.DataFrame):
        labels = X.columns
        X = X.values

    ratios = vlr(X)
    X_clr = clr(X)
    variances = np.var(X_clr, axis=0)

    # How to vectorize?
    rhos = np.ones_like(ratios)
    for i in range(n):
        for j in range(i+1,m):
            coef = 1 - (ratios[i,j]/ (variances[i] + variances[j]))
            rhos[i,j] = rhos[j,i] = coef

    if labels is not None:
        rhos = pd.DataFrame(rhos, index=labels, columns=labels)
    return rhos

# Load data
X_iris = pd.read_csv("https://pastebin.com/raw/dR59vTD4", sep="\t", index_col=0)
# Calculation
print(rho(X_iris))
#               sepal_length  sepal_width  petal_length  petal_width
# sepal_length      1.000000     0.855012     -0.796224    -0.796770
# sepal_width       0.855012     1.000000     -0.670387    -0.964775
# petal_length     -0.796224    -0.670387      1.000000     0.493560
# petal_width      -0.796770    -0.964775      0.493560     1.000000

Solution

  • You can replace the loops:

    for i in range(n):
        for j in range(i+1,m):
            coef = 1 - (ratios[i,j]/ (variances[i] + variances[j]))
            rhos[i,j] = rhos[j,i] = coef
    

    with:

    rhos = 1 - ratios / np.add.outer(variances, variances)
    

    np.add.outer(variances, variances) takes the outer sum of variances. For example:

    np.add.outer(range(3), range(3))
    >>> array([[0, 1, 2],
               [1, 2, 3],
               [2, 3, 4]])
    

    Given the shape of your arrays and the loop math we can do ratios / variances[i] + variances[j] all at once using the code above.

    And just to confirm:

    # original loop
    for i in range(n):
        for j in range(i+1,m):
            coef = 1 - (ratios[i,j]/ (variances[i] + variances[j]))
            rhos[i,j] = rhos[j,i] = coef
    
    # array math replacement
    rhos2 = 1-ratios / np.add.outer(variances, variances)
    np.allclose(rhos, rhos2)
    >>> True