Search code examples
pythonnumpyvectorization

How would you vectorize a fraction of sums of matrices (Expectation Maximization) in numpy?


I am trying to vectorize the following Expectation-Maximization / clustering equation for a 2-dimensional Gaussian distribution using numpy. I have a naive approach that I will include at the end of my question:

Expectation Maximization Covariance Matrix

For context, the variables and dimensions are defined as follows:

  • n = data point index (i.e. 1-1000)
  • k = cluster index (i.e. 1-3)
  • z = a conditional probability that datapoint n is in cluster k (in [0,1])
  • y = value of datapoint n (shape (2,))
  • mu = current estimated multi-variate mean of cluster k (shape (2,))

The end product is a numerator that is a sum of (2, 2) shape matrices and the denominator is a scalar. The final value is a (2, 2) covariate matrix estimate. This must also be done for each value of "k" (1, 2, 3).

I've achieved a vectorized approach for other values by defining the following numpy arrays:

  • Z = est. probability values for each datapoint, cluster
  • X = multivariate data matrix
  • MU = est. cluster means

My naive code is as follows:

for kk in range(k):
    numsum = 0
    for ii in range(X.shape[0]):
        diff = (X[ii, :]-mu[kk, :]).reshape(-1, 1)
        numsum = numsum + Z[ii, kk]*np.matmul(diff, diff.T)
    sigma[kk] = numsum / np.sum(Z[:, kk])

Long story long - is there any better way to do this?


Solution

  • You can use np.einsum:

    d = X - mu[:,None]
    np.einsum('ijk,ijm,ji->imk', d, d, Z/Z.sum(0, keepdims=True))