Why is the performance of the same loop algorithm differs

Right now I am doing assignments from cs 231 n , and I wanted to calculate euclidean distance between points:

dists[i, j]=0
for k in range(3072):
    dists[i, j]+=math.pow((X[i,k] - self.X_train[j,k]),2)
dists[i, j] = math.sqrt(dists[i,j])

however, this code is very slow. Then I tried

 dists[i,j] = dist = np.linalg.norm(X[i,:] - self.X_train[j,:])

which is way faster. The question is why? Doesn't np.linalg.norm also loop through all coordinates of all points, subtracts, puts into power, sums and squares them? Could someone give me a detailed answer : is it because of how does np.linalg.norm access elements or there is other reason?

Solution

NumPy can do the entire calculation in one fell swoop in optimized, accelerated (e.g. SSE, AVX, what-have-you) C code.

The original code does all of its work in Python (aside from the math functions, which are implemented in C, but also take time roundtripping Python objects), which just, well, is slower.