Search code examples
pythonvariablescluster-analysisk-meansscoping

Variable Scoping in Python


Currently, I'm writing a simple Python program for doing the k-medians clustering, however I encountered a problem which I thought related to the variable scoping.

Here is my clustering method

class Cluster(object):
    center = None
    points = []

    def __init__(self, center):
        super(Cluster, self).__init__()
        self.center = center


def manhattan(row_a, row_b):
    dimensions = len(row_a)
    manhattan_dist = 0

    for i in range(0, dimensions):
        manhattan_dist = manhattan_dist + np.abs(float(row_a[i]) - float(row_b[i]))

    return manhattan_dist

def cluster(dataset, cluster_centers):
    clusters = []
    for cluster_center in cluster_centers:
        clusters.append(Cluster(center = cluster_center))

    for point in dataset:
        last_dist = np.inf
        last_cluster = None

        for cluster in clusters:
            dist = manhattan(point, cluster.center)
            if(dist != 0):
                if (dist < last_dist):
                    print str(dist) + " " + str(last_dist)
                    last_dist = dist
                    last_cluster = cluster


        last_cluster.points.append(point)


    return clusters

result = cluster([[1,1], [1,2], [1,3], [7,2], [8,3], [7,1]], [[2,2], [6,6]])

--

result = cluster([[1,1], [1,2], [1,3], [7,2], [8,3], [7,1]], [[2,2], [6,6]])

and here is the output that I got

enter image description here

The problem is that, I had an issue assigning the value to variable "last_dist" and possibly "last_cluster" inside the clusters for-loop, the value hadn't seem to be updated at all according to what can be seen printed in the output, except for that one single iteration that it has a value of 7 before going back to be its original value "Inf" again. What is the root cause of this and what can I do with it ? Thank you


Solution

  • What else do you expect to happen? Here is your code:

    for point in dataset:
        last_dist = np.inf # this line is executed 6 times
        last_cluster = None 
    
        for cluster in clusters:
            ...
    

    You only have 2 items in clusters, and 6 in dataset. Therefore, for each point (6 times), last_dist starts as inf. You have 6 infs in your output, so that is working as expected. For the second cluster, last_dist is only printed if it meets your condition if (dist < last_dist). It looks like it does this exactly once, which is why you get 7.0 instead of inf. Perhaps you have a bug in manhattan()?

    Because