Search code examples
pythonscikit-learnk-means

Define k-1 cluster centroids -- SKlearn KMeans


I am performing a binary classification of a partially labeled dataset. I have a reliable estimate of its 1's, but not of its 0's.

From sklearn KMeans documentation:

init : {‘k-means++’, ‘random’ or an ndarray}
Method for initialization, defaults to ‘k-means++’:   
If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.

I would like to pass an ndarray, but I only have 1 reliable centroid, not 2.

Is there a way to maximize the entropy between the K-1st centroids and the Kth? Alternatively, is there a way to manually initialize K-1 centroids and use K++ for the remaining?

=======================================================

Related questions:

This seeks to define K centroids with n-1 features. (I want to define k-1 centroids with n features).

Here is a description of what I want, but it was interpreted as a bug by one of the developers, and is "easily implement[able]"


Solution

  • I'm reasonably confident this works as intended, but please correct me if you spot an error. (cobbled together from geeks for geeks):

    
    import sys
    
    def distance(p1, p2): 
        return np.sum((p1 - p2)**2)
    
    
    def find_remaining_centroid(data, known_centroids, k = 1): 
        ''' 
        initialized the centroids for K-means++ 
        inputs: 
            data - Numpy array containing the feature space
            known_centroid - Numpy array containing the location of one or multiple known centroids
            k - remaining centroids to be found
        '''
        n_points = data.shape[0]
    
        # Initialize centroids list
        if known_centroids.ndim > 1:
            centroids = [cent for cent in known_centroids]
        
        else:
            centroids = [np.array(known_centroids)]
    
        # Perform casting if necessary
        if isinstance(data, pd.DataFrame):
            data = np.array(data)
            
        # Add a randomly selected data point to the list  
        centroids.append(data[np.random.randint( 
                n_points), :])
        
        # Compute remaining k-1 centroids
        for c_id in range(k - 1):
            ## initialize a list to store distances of data 
            ## points from nearest centroid 
            dist = np.empty(n_points)
    
            for i in range(n_points):
                point = data[i, :] 
                d = sys.maxsize 
    
                ## compute distance of 'point' from each of the previously 
                ## selected centroid and store the minimum distance 
                for j in range(len(centroids)): 
                    temp_dist = distance(point, centroids[j]) 
                    d = min(d, temp_dist) 
    
                dist[i] = d
    
            ## select data point with maximum distance as our next centroid 
            next_centroid = data[np.argmax(dist), :] 
            centroids.append(next_centroid) 
    
            # Reinitialize distance array for next centroid
            dist = np.empty(n_points)
        
    
        
        return centroids[-k:]
    

    Its usage:

    # For finding a third centroid:
    third_centroid = find_remaining_centroid(X_train, np.array([presence_seed, absence_seed]), k = 1)
    
    # For finding the second centroid:
    second_centroid = find_remaining_centroid(X_train, presence_seed, k = 1)
    

    Where presence_seed and absence_seed are known centroid locations.