Search code examples
pythonmachine-learningscikit-learncluster-analysishierarchical-clustering

In scikit-learn's agglomerative clustering algorithm how would you get all the intermediate clusters?


I am running this relatively straightforward algorithm.

if I understand the algorithm correctly if you cluster to, say, 8 clusters, you should had the results for all clusters above 8, right?

Would you actually have to run the code multiple times, or how would you retrieve the intermediate clustering?

%%time
for k in K:
    start_time = time.time()  # Start timing
    
    s[k] = []
    db[k] = []
    
    np.random.seed(123456)  # for reproducibility
    model = AgglomerativeClustering(linkage='ward', connectivity=w.sparse, n_clusters=k)
    y = model.fit(cont_std)
    cont_std_['AHC_k'+ str(k)] = y.labels_
    
    silhouette_score = metrics.silhouette_score(cont_std, y.labels_, metric='euclidean')
    print('silhouette at k=' + str(k) + ': ' + str(silhouette_score))
    s[k].append(silhouette_score)
    
    davies_bouldin_score = metrics.davies_bouldin_score(cont_std, y.labels_)
    print(f'davies bouldin at k={k}: {davies_bouldin_score}')
    db[k].append(davies_bouldin_score)
    
    end_time = time.time()  # End timing
    print(f"Time for k={k}: {end_time - start_time} seconds")  # Print the duration for the cycle

Solution

  • This is probably a rather roundabout way to get there, but it appears to work. I may yet try to clean this up later.

    # Generate the list of nodes throughout the process,
    # and an array that for each node index indicates the iteration
    # at which it got merged with another.
    nodes = [[i] for i in range(len(X))]
    merged_at_stage = -np.ones(len(X) + len(model.children_), dtype=int)
    for i, merge in enumerate(model.children_):
        a, b = merge
        nodes.append(nodes[a] + nodes[b])
        merged_at_stage[a] = i
        merged_at_stage[b] = i
    
    # For a fixed number of clusters, identify the nodes
    # at that point in the process
    N_CLUSTERS = 2
    clusters = [
        nodes[i] 
        for i, x in enumerate(merged_at_stage)
        if (
            x >= len(X) - N_CLUSTERS  # the node hasn't already been merged with another
            and i <= len(X) + len(model.children_) - N_CLUSTERS  # the node has already been created
        )
    ]
    

    clusters is then a list of lists of indices. To turn that into a series of cluster labels (for scoring e.g.):

    import pandas as pd
    y_pred = pd.Series([-1] * len(X))
    for i, cluster in enumerate(clusters):
        y_pred[cluster] = i
    

    Colab notebook using the Iris dataset