python scikit-learn cluster-analysis hierarchical-clustering

AgglomerativeClustering on a correlation matrix

I have a correlation matrix of typical structure that is of size 288x288 that is defined by:

from sklearn.cluster import AgglomerativeClustering
df = read_returns()
correl_matrix = df.corr()

where read_returns gives me a dataframe with a date index, and columns of the returns of assets.

Now - I want to cluster these correlations to reduce the population size.

By doing some reading and experimenting I discovered AgglomerativeClustering - and it appears at first pass to be an appropriate solution to my problem.

I define a distance metric as ((.5*(1-correl_matrix))**.5) and have:

cluster = AgglomerativeClustering(n_clusters=40, linkage='average')
cluster.fit(((.5*(1-correl_matrix))**.5).values)
label_groups = cluster.labels_

To observe some of the data and cross check my work I pick out cluster 1 and observe the pairwise correlations and find the min correlation between two items with that group in my dataset to find :

single_cluster = []
for i in range(0,correl_matrix.shape[0]):
    if label_groups[i]==1:
        single_cluster.append(correl_matrix.index[i])

min_correl = 1.0
for x in single_cluster:
    for y in single_cluster:
        if x<>y:
            if correl_matrix[x][y]<min_correl:
                min_correl = correl_matrix[x][y]

print min_correl

and get a min pairwise correlation of .20

To me this seems quite low - but "low based off what?" is a fair question to which I have no answer.

I would like to anticipate/enforce each pairwise correlation of a cluster to be >=.7 or something like this.

Is this possible in AgglomerativeClustering?

Am I accidentally going down the wrong path?

Solution

Hierarchical clustering supports different "linkage" strategies.

single-link: this connects points on the minimum distance to the others in the cluster
complete-link: this connects based on the maximum distance to the cluster
...

If you want a high minimum correlation = small maximum distance, this calls for complete linkage.

You may want to treat negative correlations as "good", too. i.e. use dist = 1 - abs(corr).

Make sure to use ghe dendrogram. If you have outliers in your data, you want to cut into (n_clusters+n_outliers) partitions.