I don't know if I am taking it too far, but I would like to remove the for loop in the below code:
# self.assigments = (N, )
# self.points = (N, D)
# centers = (K, D)
centers = np.zeros((self.K, self.points.shape[1]))
_, counts = np.unique(self.assignments, return_counts=True)
for point, assignment in zip(self.points, self.assignments):
# The lines below assume D = 2, which is erroneous.
# It also implies a nested loop over D.
centers[assignment][0] += point[0] / counts[assignment]
centers[assignment][1] += point[1] / counts[assignment]
return centers
This code basically recalculates the centers of a K-Means cluster algorithm based on assignments. I am struggling to remove the for loop, but my intuition says it can be removed. This got me wondering, are there any rules/heuristics that one can think about while attempting to remove for
-loops through techniques like vectorization?
My expectation is an answer to whether removal of the for
-loop from the above code is possible and some rules or heuristics one can apply to optimization through for
-loop removal in general.
Sample Input & Output:
points = np.array([[-1.431, 0.878],
[-0.920, 0.488])
assignments = np.array([0, 1])
centers = np.array([[-0.299, 0.224])
output = [[-1.431 0.878 ]
[-0.920 0.488]]
So, I take that assignments is an array of N integers in interval [0,K), assigning each point to a bin. At what centers is is the mean of each bin. Right.
Here is a minimal reproducible example of the problem as I understand it
N=5
K=3
np.random.seed(0)
assignments = np.random.randint(0,K,(N,))
points = np.random.rand(N,2)
centers = np.zeros((K,2))
_, counts = np.unique(assignments, return_counts=True)
for point, assignment in zip(points, assignments):
centers[assignment][0] += point[0] / counts[assignment]
centers[assignment][1] += point[1] / counts[assignment]
Result being
array([[0.43102341, 0.55485167],
[0.45758964, 0.33427908],
[0. , 0. ]])
Which seems consistent, since points and assignments are
#points
array([[0.38438171, 0.29753461],
[0.05671298, 0.27265629],
[0.47766512, 0.81216873],
[0.47997717, 0.3927848 ],
[0.83607876, 0.33739616]])
#assignments
array([0, 1, 0, 1, 1])
So, first center is sum of points 0 and 2, divided by 2, and second sum of 1,3 and 4, divided by 3.
Before we get to vectorization, first note that we were lucky that it even worked like this. Just change the seed to 13, and result is different
IndexError: index 2 is out of bounds for axis 0 with size 2
Reason is easily understandable looking as assignments
and counts
# assignments
array([2, 0, 2, 0, 2])
# counts
array([2, 3])
So, your code relies on all possible values existing at least once.
Since we don't know where assignments
comes from, maybe that is true. But if that is false, then your code risk failing when some assignments do not happen.
So, my first suggestion would be to change your way to create counts
counts=np.zeros((K,), dtype=int)
v,c=np.unique(assignments, return_counts=True)
counts[v]=c
Which, in addition to correcting an error in your code, gives you a first hint of how vectorization can be done on array indexation: v
here is an array of int.
So adding the improvement suggested by hpaulj (or a variant of it), our code is now
N=5
K=3
np.random.seed(0)
assignments = np.random.randint(0,K,(N,))
points = np.random.rand(N,2)
centers = np.zeros((K,2))
counts=np.zeros((K,), dtype=int)
v,c=np.unique(assignments, return_counts=True)
counts[v]=c
for point, assignment in zip(points, assignments):
centers[assignment] += point / counts[assignment]
Giving of course the exact same result
It would be tempting now to continue that vectorization by doing
centers[assignments] += points / counts[assignments, None]
But this doesn't work. It gives
array([[0.23883256, 0.40608436],
[0.27869292, 0.11246539],
[0. , 0. ]])
Which, as we can easily check, is, for each bin, counting only the last point (0.2388 is 0.47766512/2, and 0.27869282 is 0.83607876/3).
Because that is like computing centers[assignments] + points / counts[assignments, None]
and assigning those values to centers[assignment]
. So exactly as centers[[0,1,0]]=[1,2,3]
means that centers[0]
is now 3 and centers[1]
is 2 (and value 1, that was briefly assigned to centers[0]
has been overwritten), all but that values of each bin are overwritten.
One method, still using a for loop, but a smaller one (well, I take that K is supposed to be smaller than N) would be to loop over the bins
for k in range(K):
mask = assignments==k
centers[k] = points[mask].sum(axis=0) / counts[k]
But if one bin is empty, that leads to division by 0. Which is easily solved by iterating only existing value (we have already computed this array, called v
earlier)
for k in v:
mask = assignments==k
centers[k] = points[mask].sum(axis=0) / counts[k]
Which gives still the same result as earlier, with less iterations (v cannot be bigger than N. And I expect that it is even way smaller).
Since hpaulj mentionned bincount
, we can indeed use it. But not in one line.
bincount
is just what you did with unique
. I didn't mentioned it before, because I needed the values (my v
vector). But you didn't. You ignored values output of unique
, to keep only counts
.
In that case, you could have used bincount
: counts = np.bincount(assignments)
.
That is what bincount
does.
But it is not just a less capable version of unique
(I mean a version of unique
that can't provide list of values). bincount
also accept an optional variable weights
, and it that case, what it returns it a weighted count.
For example,
np.bincount([0,1,2,0,1], [10,5,2,1,1])
returns an array of 3 values (since there are 3 bins
, 3 possible values):
[10+1, 5+1, 2]=[11,6,2]
Note another difference between bincounts
and unique
: it counts all possible values between 0 and the max value. Including those that never occur.
np.unique([3,3,5], return_counts=True)[1]
returns [2,1]
while np.bincount([3,3,5])
returns array([0, 0, 0, 2, 0, 1])
So, that could have also solved our previous problem (but, again, I needed v
anyway).
So, now, if we compute
np.bincount(assignments, weights=points[:,0])
we get
array([0.86204682, 1.37276891])
Which is, for each of the two possible values. Since, 2
never occur with the current seed, 0, there are only 2 values (if it had been 0 or 1 that never occured, like we seed 13, we would have got 3 values, one being 0). It would be better to have K
. We can force K
values using optional argument minlength
np.bincount(assignments, weights=points[:,0], minlength=K)
So those are for each value k in range(0,K), the sum of points[:,0][k]
, that is points[k,0]
.
To get what we want we need to divide by the number of points in each bin
np.bincount(assignments, weights=points[:,0], minlength=K) / np.bincount(assignments, minlength=K)
We get
array([0.43102341, 0.45758964, nan])
Those are the x
values of the expected centers
.
Likewise
np.bincount(assignments, weights=points[:,1], minlength=K) / np.bincount(assignments, minlength=K)
Returns
array([0.55485167, 0.33427908, nan])
which are the y
values of the centers
we expect
Well, almost there. There is this nan
.
It could be tempting to say "but this is because you added this minlength=K
". Indeed, without it, we would have just computed this for the two non-0 cases. But that would have solved only the case when it is value K-1 that never occur. For seed 13, for example, we would still have a problem.
One solution is simply to ensure that the divisor is never 0 (after all, in those 0 cases, we don't really care, since numerator is 0 also).
Hence my final ansewr
N=5
K=3
np.random.seed(0)
assignments = np.random.randint(0,K,(N,))
points = np.random.rand(N,2)
centers = np.zeros((K,2))
bc=np.maximum(1, np.bincount(assignments, minlength=K))
centers[:,0] = np.bincount(assignments, minlength=K, weights=points[:,0])/bc
centers[:,1] = np.bincount(assignments, minlength=K, weights=points[:,1])/bc
Giving the same result as before.
Note that there is still a for loop (here unrolled: I repeated the line twice).
It doesn't iterate over N points, nor over K possible values, but over the (here) 2 dimensions.
But the fact that, in your own code, you replaced 2
by points.shape[1]
means that the dimension might not be 2. So you might need to do an actual for
loop
for dim in range(points.shape[1]):
centers[:,dim] = np.bincount(assignments, minlength=K, weights=points[:,dim])/bc
But, well, I suppose we can expect that the dimension of your points are well under both N and K.
For N=1000000 points, K=100 bins, and Dim=3 dimensions (that is the 1000000 points are triplets (x,y,z) of coordinates in a 3d-space), timings are as follows (on my rather slow PC):
Method | Timing |
---|---|
Initial (iterations over N and Dim) | 4151 ms |
Iterations over K | 693 ms |
Last solution (iterations over just Dim) | 29 ms |