I'm trying to speed up my code to take full advantage of NumPy's vectorization. I've been able to vectorize all of the computations in the majority of my code (clr/vlr
functions) which I believe is the optimum for NumPy. I don't believe these can be sped up anymore as np.einsum
doesn't really apply. I can't figure out how to vectorize the rho
function. In particular, it's this nested for-loop that is giving me trouble: 1 - (ratios[i,j]/ (variances[i] + variances[j]))
. I've tried using np.meshgrid
to try and expand the variances but this didn't work. I've been trying to follow this tutorial on np.einsum
but, still, it's a bit confusing and I'm still not sure if it applies to this situation as it involves more operations than dot products and matrix multiplication.
Does anyone know how to vectorize this function?
Extra, are there any suggestions to speed up my code for the previous 2 function?
import numpy as np
import pandas as pd
def clr(X):
index = None
labels = None
if isinstance(X, pd.DataFrame):
index = X.index
labels = X.columns
X = X.values
X_log = np.log(X)
geometric_mean = X_log.mean(axis=1)
X_clr = X_log - geometric_mean[:,np.newaxis]
if labels is not None:
X_clr = pd.DataFrame(X_clr, index=index, columns=labels)
return X_clr
def vlr(X):
labels = None
if isinstance(X, pd.DataFrame):
labels = X.columns
X = X.values
n,m = X.shape
X_log = np.log(X)
covariance = np.cov(X_log.T)
diagonal = np.diagonal(covariance)
output = -2*covariance + diagonal[:,np.newaxis] + diagonal
if labels is not None:
output = pd.DataFrame(output, index=labels, columns=labels)
return output
def rho(X):
n, m = X.shape
labels = None
if isinstance(X, pd.DataFrame):
labels = X.columns
X = X.values
ratios = vlr(X)
X_clr = clr(X)
variances = np.var(X_clr, axis=0)
# How to vectorize?
rhos = np.ones_like(ratios)
for i in range(n):
for j in range(i+1,m):
coef = 1 - (ratios[i,j]/ (variances[i] + variances[j]))
rhos[i,j] = rhos[j,i] = coef
if labels is not None:
rhos = pd.DataFrame(rhos, index=labels, columns=labels)
return rhos
# Load data
X_iris = pd.read_csv("https://pastebin.com/raw/dR59vTD4", sep="\t", index_col=0)
# Calculation
print(rho(X_iris))
# sepal_length sepal_width petal_length petal_width
# sepal_length 1.000000 0.855012 -0.796224 -0.796770
# sepal_width 0.855012 1.000000 -0.670387 -0.964775
# petal_length -0.796224 -0.670387 1.000000 0.493560
# petal_width -0.796770 -0.964775 0.493560 1.000000
You can replace the loops:
for i in range(n):
for j in range(i+1,m):
coef = 1 - (ratios[i,j]/ (variances[i] + variances[j]))
rhos[i,j] = rhos[j,i] = coef
with:
rhos = 1 - ratios / np.add.outer(variances, variances)
np.add.outer(variances, variances)
takes the outer sum of variances
. For example:
np.add.outer(range(3), range(3))
>>> array([[0, 1, 2],
[1, 2, 3],
[2, 3, 4]])
Given the shape of your arrays and the loop math we can do ratios / variances[i] + variances[j]
all at once using the code above.
And just to confirm:
# original loop
for i in range(n):
for j in range(i+1,m):
coef = 1 - (ratios[i,j]/ (variances[i] + variances[j]))
rhos[i,j] = rhos[j,i] = coef
# array math replacement
rhos2 = 1-ratios / np.add.outer(variances, variances)
np.allclose(rhos, rhos2)
>>> True