Search code examples
pythoncluster-analysishierarchical-clustering

How to do data correlation clustering plot in python


I've got a database that contains information about commits done to a repo. For e.g

commit-sha1 | file1 | 
commit-sha1 | file2 |
commit-sha2 | file2 |
commit-sha2 | file3 | 

and so on. Basically, showing that sha1 changed files (file1, file2) and sha2 changed (file2, file3) Now I wanted to see if some files are co-related, i.e what are the chances that file1 and file2 are committed together etc. For this, first I found out top 50 files that are most commonly committed which gave me

file1 - 1500
file2 - 1423
file3 - 1222..
  • For each file f, calculate P(f) = commits containing f / total commits.
  • For each pairs of files f1, f2, calculate Q(f1, f2) = commits containing both f1, f2 / total commits
  • For each pairs of files f1, f2, calculate D(f1, f2) = P(f1)*P(f2) / [Q(f1, f2) – P(f1) * P(f2)] or infinity if Q(f1, f2) <= P(f1) * P(f2) After I followed the above, I now have 2 pairs for files and their D(f1, f2) value which looks like this

    two_pair_list = [['file1', 'file2'], ['file1', 'file3']...['file49', 'file50']]

    d_value = [3.2, -1, 0.12, 7.6, -1, ...]

I've put -1 as d_value when Q(f1, f2) <= P(f1) * P(f2) i.e for e.g, as there were no commits in db which contained both file1 and file3 together (i.e Q(file1, file3) = 0), its d_value is -1. Now assuming I've the d_value list for pairs of files, how can I perform hierarchical clustering to see which files are co-related? I believe the python's linkage() API will help but I'm not sure how to use it with this data. Any help is appreciated Thanks


Solution

  • A simple example:

    from scipy.cluster.hierarchy import dendrogram, linkage
    import numpy as np
    from matplotlib import pyplot as plt
    
    d_value = np.array([ 3.2 , 100,  0.12,  7.6 , 100,  5.2 ])
    Z = linkage(dm, 'ward')
    fig = plt.figure()
    dn = dendrogram(Z)
    

    The result:

    enter image description here

    Note that I've changed your -1 into 100 since the distance of file1 and file3 should be large when they haven't been committed together.