Search code examples
pythonnumpyvectorization

Remove for-loop challenge and existence of heuritics to remove for-loops in general


I don't know if I am taking it too far, but I would like to remove the for loop in the below code:

# self.assigments = (N, )
# self.points = (N, D)
# centers = (K, D)
centers = np.zeros((self.K, self.points.shape[1]))
_, counts = np.unique(self.assignments, return_counts=True)
for point, assignment in zip(self.points, self.assignments):
    # The lines below assume D = 2, which is erroneous.
    # It also implies a nested loop over D.
    centers[assignment][0] += point[0] / counts[assignment]
    centers[assignment][1] += point[1] / counts[assignment]
return centers

This code basically recalculates the centers of a K-Means cluster algorithm based on assignments. I am struggling to remove the for loop, but my intuition says it can be removed. This got me wondering, are there any rules/heuristics that one can think about while attempting to remove for-loops through techniques like vectorization?

My expectation is an answer to whether removal of the for-loop from the above code is possible and some rules or heuristics one can apply to optimization through for-loop removal in general.

Sample Input & Output:

points = np.array([[-1.431,  0.878],
                   [-0.920,  0.488])
assignments = np.array([0, 1])
centers = np.array([[-0.299,  0.224])
output = [[-1.431  0.878 ]
          [-0.920  0.488]]

Solution

  • mre

    So, I take that assignments is an array of N integers in interval [0,K), assigning each point to a bin. At what centers is is the mean of each bin. Right.

    Here is a minimal reproducible example of the problem as I understand it

    N=5
    K=3
    np.random.seed(0)
    assignments = np.random.randint(0,K,(N,))
    points = np.random.rand(N,2)
    centers = np.zeros((K,2))
    _, counts = np.unique(assignments, return_counts=True)
    for point, assignment in zip(points, assignments):
        centers[assignment][0] += point[0] / counts[assignment]
        centers[assignment][1] += point[1] / counts[assignment]
    

    Result being

    array([[0.43102341, 0.55485167],
           [0.45758964, 0.33427908],
           [0.        , 0.        ]])
    

    Which seems consistent, since points and assignments are

    #points
    array([[0.38438171, 0.29753461],
           [0.05671298, 0.27265629],
           [0.47766512, 0.81216873],
           [0.47997717, 0.3927848 ],
           [0.83607876, 0.33739616]])
    #assignments
    array([0, 1, 0, 1, 1])
    

    So, first center is sum of points 0 and 2, divided by 2, and second sum of 1,3 and 4, divided by 3.

    error if some assignment never occur

    Before we get to vectorization, first note that we were lucky that it even worked like this. Just change the seed to 13, and result is different

    IndexError: index 2 is out of bounds for axis 0 with size 2
    

    Reason is easily understandable looking as assignments and counts

    # assignments
    array([2, 0, 2, 0, 2])
    # counts
    array([2, 3])
    

    So, your code relies on all possible values existing at least once.

    Since we don't know where assignments comes from, maybe that is true. But if that is false, then your code risk failing when some assignments do not happen.

    So, my first suggestion would be to change your way to create counts

    counts=np.zeros((K,), dtype=int)
    v,c=np.unique(assignments, return_counts=True)
    counts[v]=c
    

    Which, in addition to correcting an error in your code, gives you a first hint of how vectorization can be done on array indexation: v here is an array of int.

    So adding the improvement suggested by hpaulj (or a variant of it), our code is now

    N=5
    K=3
    np.random.seed(0)
    assignments = np.random.randint(0,K,(N,))
    points = np.random.rand(N,2)
    centers = np.zeros((K,2))
    counts=np.zeros((K,), dtype=int)
    v,c=np.unique(assignments, return_counts=True)
    counts[v]=c
    for point, assignment in zip(points, assignments):
        centers[assignment] += point / counts[assignment]
    

    Giving of course the exact same result

    It would be tempting now to continue that vectorization by doing

    centers[assignments] += points / counts[assignments, None]
    

    But this doesn't work. It gives

    array([[0.23883256, 0.40608436],
           [0.27869292, 0.11246539],
           [0.        , 0.        ]])
    

    Which, as we can easily check, is, for each bin, counting only the last point (0.2388 is 0.47766512/2, and 0.27869282 is 0.83607876/3).

    Because that is like computing centers[assignments] + points / counts[assignments, None] and assigning those values to centers[assignment]. So exactly as centers[[0,1,0]]=[1,2,3] means that centers[0] is now 3 and centers[1] is 2 (and value 1, that was briefly assigned to centers[0] has been overwritten), all but that values of each bin are overwritten.

    Iterating over K bins instead of N points

    One method, still using a for loop, but a smaller one (well, I take that K is supposed to be smaller than N) would be to loop over the bins

    for k in range(K):
        mask = assignments==k
        centers[k] = points[mask].sum(axis=0) / counts[k]
    

    But if one bin is empty, that leads to division by 0. Which is easily solved by iterating only existing value (we have already computed this array, called v earlier)

    for k in v:
        mask = assignments==k
        centers[k] = points[mask].sum(axis=0) / counts[k]
    

    Which gives still the same result as earlier, with less iterations (v cannot be bigger than N. And I expect that it is even way smaller).

    bincount or iterating over 2 dimensions

    Since hpaulj mentionned bincount, we can indeed use it. But not in one line.

    bincount is just what you did with unique. I didn't mentioned it before, because I needed the values (my v vector). But you didn't. You ignored values output of unique, to keep only counts. In that case, you could have used bincount : counts = np.bincount(assignments).
    That is what bincount does.

    But it is not just a less capable version of unique (I mean a version of unique that can't provide list of values). bincount also accept an optional variable weights, and it that case, what it returns it a weighted count.

    For example,

    np.bincount([0,1,2,0,1], [10,5,2,1,1])
    

    returns an array of 3 values (since there are 3 bins, 3 possible values):
    [10+1, 5+1, 2]=[11,6,2]

    Note another difference between bincounts and unique: it counts all possible values between 0 and the max value. Including those that never occur.

    np.unique([3,3,5], return_counts=True)[1]
    returns [2,1]

    while np.bincount([3,3,5])
    returns array([0, 0, 0, 2, 0, 1])

    So, that could have also solved our previous problem (but, again, I needed v anyway).

    So, now, if we compute
    np.bincount(assignments, weights=points[:,0])
    we get
    array([0.86204682, 1.37276891])

    Which is, for each of the two possible values. Since, 2 never occur with the current seed, 0, there are only 2 values (if it had been 0 or 1 that never occured, like we seed 13, we would have got 3 values, one being 0). It would be better to have K. We can force K values using optional argument minlength

    np.bincount(assignments, weights=points[:,0], minlength=K)
    

    So those are for each value k in range(0,K), the sum of points[:,0][k], that is points[k,0].

    To get what we want we need to divide by the number of points in each bin

    np.bincount(assignments, weights=points[:,0], minlength=K) / np.bincount(assignments, minlength=K)
    

    We get
    array([0.43102341, 0.45758964, nan])

    Those are the x values of the expected centers.

    Likewise

    np.bincount(assignments, weights=points[:,1], minlength=K) / np.bincount(assignments, minlength=K)
    

    Returns
    array([0.55485167, 0.33427908, nan])

    which are the y values of the centers we expect

    Well, almost there. There is this nan. It could be tempting to say "but this is because you added this minlength=K". Indeed, without it, we would have just computed this for the two non-0 cases. But that would have solved only the case when it is value K-1 that never occur. For seed 13, for example, we would still have a problem.

    One solution is simply to ensure that the divisor is never 0 (after all, in those 0 cases, we don't really care, since numerator is 0 also).

    Hence my final ansewr

    N=5
    K=3
    np.random.seed(0)
    assignments = np.random.randint(0,K,(N,))
    points = np.random.rand(N,2)
    centers = np.zeros((K,2))
    bc=np.maximum(1, np.bincount(assignments, minlength=K))
    centers[:,0] = np.bincount(assignments, minlength=K, weights=points[:,0])/bc
    centers[:,1] = np.bincount(assignments, minlength=K, weights=points[:,1])/bc
    

    Giving the same result as before.

    Note that there is still a for loop (here unrolled: I repeated the line twice). It doesn't iterate over N points, nor over K possible values, but over the (here) 2 dimensions. But the fact that, in your own code, you replaced 2 by points.shape[1] means that the dimension might not be 2. So you might need to do an actual for loop

    for dim in range(points.shape[1]):
        centers[:,dim] = np.bincount(assignments, minlength=K, weights=points[:,dim])/bc
    

    But, well, I suppose we can expect that the dimension of your points are well under both N and K.

    Timings

    For N=1000000 points, K=100 bins, and Dim=3 dimensions (that is the 1000000 points are triplets (x,y,z) of coordinates in a 3d-space), timings are as follows (on my rather slow PC):

    Method Timing
    Initial (iterations over N and Dim) 4151 ms
    Iterations over K 693 ms
    Last solution (iterations over just Dim) 29 ms