Search code examples
pythonmachine-learningscikit-learncluster-analysisunsupervised-learning

How to tune / choose the preference parameter of AffinityPropagation?


I have large dictionary of "pairwise similarity matrixes" that would look like the following:

similarity['group1']:

array([[1.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 1.        , 0.09      , 0.09      , 0.        ],
       [0.        , 0.09      , 1.        , 0.94535157, 0.        ],
       [0.        , 0.09      , 0.94535157, 1.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 1.        ]])

In short, every element of the previous matrix is the probability that record_i and record_j are similar (values being 0 and 1 inclusive), 1 being exactly similar and 0 being completely different.

I then feed each similarity matrix into an AffinityPropagation algorithm in order to group / cluster similar records:

sim = similarities['group1']

clusterer = AffinityPropagation(affinity='precomputed', 
                                damping=0.5, 
                                max_iter=25000, 
                                convergence_iter=2500, 
                                preference=????)) # ISSUE here

affinity = clusterer.fit(sim)

cluster_centers_indices = affinity.cluster_centers_indices_
labels = affinity.labels_

However, since I run the above on multiple similarity matrixes, I need to have a generalised preference parameter which I can't seem to tune.

It says in the docs that it's by default set as the median of the similarity matrix, however I get lots of false positives with this setup, the mean sometimes work sometimes gives too many clusters etc...


e.g: when playing with the preference parameter, these are the results I get from the similarity matrix

  • preference = default # which is the median (value 0.2) of the similarity matrix: (incorrect results, we see that the record 18 shouldn't be there because the similarity with the other records is very low):

     # Indexes of the elements in Cluster n°5: [15, 18, 22, 27]
    
     {'15_18': 0.08,
     '15_22': 0.964546229533378,
     '15_27': 0.6909703138051403,
     '18_22': 0.12,    # Not Ok, the similarity is too low
     '18_27': 0.19,    # Not Ok, the similarity is too low
     '22_27': 0.6909703138051403}
    
  • preference = 0.2 in fact from 0.11 to 0.26: (correct results as the records are similar):

     # Indexes of the elements in Cluster n°5: [15, 22, 27]
    
     {'15_22': 0.964546229533378,
     '15_27': 0.6909703138051403,
     '22_27': 0.6909703138051403}
    

My question is: How should I choose this preference parameter in a way that would generalise?


Solution

  • A naive and brute force grid search solution can be implemented as such if a connection is scored less than a certain threshold (0.5 for example), we'd re-run the clustering with an adjusted value of the preference parameter.

    A naive implementation would be like the following.


    First, a function to test whether a clustering needs tuning, the threshold being 0.5 in this example:

    def is_tuning_required(similarity_matrix, rows_of_cluster):
        rows = similarity_matrix[rows_of_cluster]
    
        for row in rows:
            for col_index in rows_of_cluster:
                score = row[col_index]
    
                if score > 0.5:
                    continue
    
                return True
    
        return False
    

    Build a preference range of values against which the clustering would run:

    def get_pref_range(similarity):
        starting_point = np.median(similarity)
    
        if starting_point == 0:
            starting_point = np.mean(similarity)
    
        # Let's try to accelerate the pace of values picking
        step = 1 if starting_point >= 0.05 else step = 2
    
        preference_tuning_range = [starting_point]
        max_val = starting_point
        while max_val < 1:
            max_val *= 1.25 if max_val > 0.1 and step == 2 else step
    
        preference_tuning_range.append(max_val)
    
        min_val = starting_point
        if starting_point >= 0.05:
            while min_val > 0.01:
                min_val /= step
                preference_tuning_range.append(min_val)
    
        return preference_tuning_range
    

    A normal AfinityPropagation, with a preference parameter passed:

    def run_clustering(similarity, preference):
        clusterer = AffinityPropagation(damping=0.9, 
                                        affinity='precomputed', 
                                        max_iter=5000, 
                                        convergence_iter=2500, 
                                        verbose=False, 
                                        preference=preference)
    
        affinity = clusterer.fit(similarity)
    
        labels = affinity.labels_
    
        return labels, len(set(labels)), affinity.cluster_centers_indices_
    

    The method we would actually call with a similarity (1 - distance) matrix as an argument:

    def run_ideal_clustering(similarity):
        preference_tuning_range = get_pref_range(similarity)
    
        best_tested_preference = None
        for preference in preference_tuning_range:
            labels, labels_count, cluster_centers_indices = run_clustering(similarity, preference)
    
            needs_tuning = False
            wrong_clusters = 0
            for label_index in range(labels_count):
                cluster_elements_indexes = np.where(labels == label_index)[0]
    
                tuning_required = is_tuning_required(similarity, cluster_elements_indexes)
                if tuning_required:
                    wrong_clusters += 1
    
                    if not needs_tuning:
                        needs_tuning = True
    
            if best_tested_preference is None or wrong_clusters < best_tested_preference[1]:
                best_tested_preference = (preference, wrong_clusters)
    
            if not needs_tuning:
                return labels, labels_count, cluster_centers_indices
    
         # The clustering has not been tuned enough during the iterations, we choose the less wrong clusters
        return run_clustering(similarity, preference)
    

    Obviously, this is a brute force solution which will not be performant in large datasets / similarity matrixes.

    If a simpler and better solution gets posted I'll accept it.