Finding transpose of each element of list separately in python

I have a List of N items, each of d dimensions (so essentially a N x d list). For each of the items, I want to find the product of item transposed with itself, So, x.xT for each of the N items. This will give me a N x d x d array. How can I do it efficiently in numpy. At this moment, I am looping through each of the items and finding the transpose separately.

for i in range(len(mu[0])):
    current_mu = mu[i] # list of d elements
    distances = []
    for index in range(len(samples)):
        distance = np.asarray(current_mu - samples[index])[:, None] # list of d elements
        distances.append(distance * distance.T) # each becomes d x d

Can I remove the second nested loop or is it required?

Solution

You can use numpy.einsum as follows:

import numpy as np

N,d = 10,5

mu = np.random.rand(N,d)
r = np.einsum('ni,nj->nij', mu, mu)

r.shape
(10,5,5)

Comparing to a for-loop implementation:

def for_loop(a):
    N,d = a.shape
    r = np.zeros((N,d,d))
    for i in range(N):
        r[i] = a[i][:,None] @ a[i][None,:]

# N>d case
N,d = 1000,500
mu = np.random.rand(N,d)

%timeit np.einsum('ni,nj->nij', mu, mu)
1.29 s ± 11.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit for_loop(mu)
2.36 s ± 45.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# N<d case
N,d = 100,1000
mu = np.random.rand(N,d)

%timeit np.einsum('ni,nj->nij', mu, mu)
521 ms ± 9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit for_loop(mu)
976 ms ± 18.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In both cases resulting in almost 2x performances.