I've got a database that contains information about commits done to a repo. For e.g
commit-sha1 | file1 |
commit-sha1 | file2 |
commit-sha2 | file2 |
commit-sha2 | file3 |
and so on. Basically, showing that sha1 changed files (file1, file2) and sha2 changed (file2, file3) Now I wanted to see if some files are co-related, i.e what are the chances that file1 and file2 are committed together etc. For this, first I found out top 50 files that are most commonly committed which gave me
file1 - 1500
file2 - 1423
file3 - 1222..
For each pairs of files f1, f2, calculate D(f1, f2) = P(f1)*P(f2) / [Q(f1, f2) – P(f1) * P(f2)] or infinity if Q(f1, f2) <= P(f1) * P(f2) After I followed the above, I now have 2 pairs for files and their D(f1, f2) value which looks like this
two_pair_list = [['file1', 'file2'], ['file1', 'file3']...['file49', 'file50']]
d_value = [3.2, -1, 0.12, 7.6, -1, ...]
I've put -1 as d_value when Q(f1, f2) <= P(f1) * P(f2) i.e for e.g, as there were no commits in db which contained both file1 and file3 together (i.e Q(file1, file3) = 0), its d_value is -1. Now assuming I've the d_value list for pairs of files, how can I perform hierarchical clustering to see which files are co-related? I believe the python's linkage() API will help but I'm not sure how to use it with this data. Any help is appreciated Thanks
A simple example:
from scipy.cluster.hierarchy import dendrogram, linkage
import numpy as np
from matplotlib import pyplot as plt
d_value = np.array([ 3.2 , 100, 0.12, 7.6 , 100, 5.2 ])
Z = linkage(dm, 'ward')
fig = plt.figure()
dn = dendrogram(Z)
The result:
Note that I've changed your -1
into 100
since the distance of file1 and file3 should be large when they haven't been committed together.