Search code examples
pythoncluster-analysisk-meansunsupervised-learning

TypeError: len() of unsized object in pyclustering library


I am using the pyclustering library to perform K-means. The datasets I am using are being read in CSV format as shown in the code below. I have tried passing X_scaled as a numpy array, as a list using to_list(). However, I constantly get this error:

TypeError: len() of unsized object

Version of pyclustering: 0.10.1.2

The code is below:

from pyclustering.cluster.kmeans import kmeans
from pyclustering.utils.metric import distance_metric, type_metric
import matplotlib.pyplot as plt
import numpy as np

# Define a function to convert distance metric names to functions
def get_distance_metric(metric_name):
    if metric_name == 'euclidean':
        return distance_metric(type_metric.EUCLIDEAN)
    elif metric_name == 'squared euclidean':
        return distance_metric(type_metric.EUCLIDEAN_SQUARE)
    elif metric_name == 'manhattan':
        return distance_metric(type_metric.MANHATTAN)
    elif metric_name == 'chebyshev':
        return distance_metric(type_metric.CHEBYSHEV)
    elif metric_name == 'canberra':
        return distance_metric(type_metric.CANBERRA)
    elif metric_name == 'chi-square':
        return distance_metric(type_metric.CHI_SQUARE)
    else:
        raise ValueError(f"Unsupported distance metric: {metric_name}")

# Define the distance measures dictionary
distance_measures = {'euclidean': 0, 'squared euclidean': 1, 'manhattan': 2, 'chebyshev': 3, 
                    'canberra': 5, 'chi-square': 6}

# Example of running the modified code
datasets = main_datasets
df = datasets['circles0.3.csv']

original_labels = df['label'].values if 'label' in df.columns else None
X = df.drop(columns=['label'], errors='ignore').values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
    
# Set the number of clusters
k = 3

# Experiment with various distance metrics
for metric_name, metric_code in distance_measures.items():
    # Get the distance metric function
    distance_metric_func = get_distance_metric(metric_name)
    
    # Perform K-means clustering with the selected distance metric
    
    # centers, clusters = kmeans(X_scaled.tolist(), k, metric=distance_metric_func)
    centers, clusters = kmeans(X_scaled, k, metric=distance_metric_func)
    
    # Plot the clusters
    plt.figure()
    plt.title(f'K-means Clustering with {metric_name}')
    plt.xlabel('X')
    plt.ylabel('Y')
    plt.scatter([point[0] for point in X_scaled], [point[1] for point in X_scaled], c=clusters, cmap='viridis')
    plt.scatter([center[0] for center in centers], [center[1] for center in centers], marker='x', c='red', s=100)
    plt.show()

Can anybody help me out with what the issue might be with this code?


Solution

  • In the following function call, you should add 'initial centers' argument instead of k. Additionally, you must convert X_scaled array to a list before passing to kmeans function.

    centers, clusters = kmeans(X_scaled, k, metric=distance_metric_func)
    

    Use the below code instead:

    from pyclustering.cluster.center_initializer import random_center_initializer    
    X_scaled_list = X_scaled.tolist()
    initial_centers = random_center_initializer(X_scaled_list,2).initialize()
    result_1 = kmeans(X_scaled_list, initial_centers)