machine-learning cluster-analysis hierarchical-clustering cosine-similarity

What should be given as an input to linkage function - tfidf matrix or similarity between different elements of tfidf matrixes?

I have the following python notebook which aims to cluster different groups of abstracts based on the similarity between their text. I have two approaches here: one to use tfidf numpy array of documents as it is in the linkage function and second is to find the similarity between the tfidf array of different documents and then to use that similarity matrix for clustering. I am unable to understand which one is correct.

Approach 1:

I used cosine_similarity to find out the similarity matrix (cosine) of tfidf matrix. I first converted the redundant matrix (cosine) into the condensed distance matrix (distance_matrix) using squareform function. Then distance_matrix is fed into linkage function and using Dendograms I have plotted the graph.

Approach 2:

I used the condensed form of tfidf numpy array into the linkage function and plotted the dendograms.

My question is what is correct? According to the data as far as i can understand, the approach 2 seems to be correct, but to me approach 1 makes sense. It would be great if someone can explain me what is right here in this scenario. Thanks in advance.

Let me know if anything remains unclear in the question.

import pandas, numpy
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer

###Data Cleaning

stop_words = stopwords.words('english')
tokenizer = RegexpTokenizer(r'\w+')
df=pandas.read_csv('WIPO_CSV.csv')


import sys
reload(sys)
sys.setdefaultencoding('utf8')


documents_no_stopwords=[]

def preprocessing(word):
    tokens = tokenizer.tokenize(word)

    processed_words = []
    for w in tokens:
        if w in stop_words:
            continue
        else:
            processed_words.append(w)

***This step creates a list of text documents with only the nouns in    them***
    documents_no_stopwords.append(' '.join(processed_words))

for text in df['TEXT'].tolist():
    preprocessing(text)

***Converting into tfidf form***
*Latin is used as utf8 decoder was facing some trouble with the text.*

vectoriser = TfidfVectorizer(encoding='latin1')

***we have numpy here which is in normalised form***

tfidf_documents = vectoriser.fit_transform(documents_no_stopwords)


##Cosine Similarity as the input to linkage should be a distance vector

from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial.distance import squareform

cosine = cosine_similarity(tfidf_documents)
distance_matrix = squareform(cosine,force='tovector',checks=False)

from scipy.cluster.hierarchy import dendrogram, linkage

##Linkage based on tfidf of each document
z_num=linkage(tfidf_documents.todense(),'ward')

z_num  #tfidf

array([[11.        , 12.        ,  0.        ,  2.        ],
   [18.        , 19.        ,  0.        ,  2.        ],
   [20.        , 31.        ,  0.        ,  3.        ],
   [21.        , 32.        ,  0.        ,  4.        ],
   [22.        , 33.        ,  0.        ,  5.        ],
   [17.        , 34.        ,  0.38208619,  6.        ],
   [15.        , 28.        ,  1.19375843,  2.        ],
   [ 6.        ,  9.        ,  1.24241258,  2.        ],
   [ 7.        ,  8.        ,  1.27069483,  2.        ],
   [13.        , 37.        ,  1.28868301,  3.        ],
   [ 4.        , 24.        ,  1.30850122,  2.        ],
   [36.        , 39.        ,  1.32090275,  5.        ],
   [10.        , 16.        ,  1.32602346,  2.        ],
   [27.        , 38.        ,  1.32934025,  3.        ],
   [23.        , 25.        ,  1.32987072,  2.        ],
   [ 3.        , 29.        ,  1.35143582,  2.        ],
   [ 5.        , 14.        ,  1.35401753,  2.        ],
   [26.        , 42.        ,  1.35994878,  3.        ],
   [ 2.        , 45.        ,  1.40055438,  3.        ],
   [ 0.        , 40.        ,  1.40811825,  3.        ],
   [ 1.        , 46.        ,  1.41383622,  3.        ],
   [44.        , 50.        ,  1.4379821 ,  5.        ],
   [41.        , 43.        ,  1.44575227,  8.        ],
   [48.        , 51.        ,  1.45876241,  8.        ],
   [49.        , 53.        ,  1.47130328, 11.        ],
   [47.        , 52.        ,  1.49944936, 11.        ],
   [54.        , 55.        ,  1.69814818, 22.        ],
   [30.        , 56.        ,  1.91299937, 24.        ],
   [35.        , 57.        ,  3.1967033 , 30.        ]])

from matplotlib import pyplot as plt

plt.figure(figsize=(25, 10))
dn = dendrogram(z_num)
plt.show()

Linkage based on similarity

z_sim=linkage(distance_matrix,'ward')
z_sim  *Cosine Similarity*

array([[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 2.00000000e+00],
   [2.00000000e+00, 3.00000000e+01, 0.00000000e+00, 3.00000000e+00],
   [1.70000000e+01, 3.10000000e+01, 0.00000000e+00, 4.00000000e+00],
   [3.00000000e+00, 4.00000000e+00, 0.00000000e+00, 2.00000000e+00],
   [1.00000000e+01, 3.30000000e+01, 0.00000000e+00, 3.00000000e+00],
   [5.00000000e+00, 7.00000000e+00, 0.00000000e+00, 2.00000000e+00],
   [6.00000000e+00, 1.80000000e+01, 0.00000000e+00, 2.00000000e+00],
   [1.10000000e+01, 1.90000000e+01, 0.00000000e+00, 2.00000000e+00],
   [1.20000000e+01, 2.00000000e+01, 0.00000000e+00, 2.00000000e+00],
   [8.00000000e+00, 2.40000000e+01, 0.00000000e+00, 2.00000000e+00],
   [1.60000000e+01, 2.10000000e+01, 0.00000000e+00, 2.00000000e+00],
   [2.20000000e+01, 2.70000000e+01, 0.00000000e+00, 2.00000000e+00],
   [9.00000000e+00, 2.90000000e+01, 0.00000000e+00, 2.00000000e+00],
   [2.60000000e+01, 4.20000000e+01, 0.00000000e+00, 3.00000000e+00],
   [1.40000000e+01, 3.40000000e+01, 3.97089886e-03, 4.00000000e+00],
   [2.30000000e+01, 4.40000000e+01, 1.81733052e-02, 5.00000000e+00],
   [3.20000000e+01, 3.50000000e+01, 2.14592323e-02, 6.00000000e+00],
   [2.50000000e+01, 4.00000000e+01, 2.84944415e-02, 3.00000000e+00],
   [1.30000000e+01, 4.70000000e+01, 5.02045376e-02, 4.00000000e+00],
   [4.10000000e+01, 4.30000000e+01, 5.10902795e-02, 5.00000000e+00],
   [3.70000000e+01, 4.50000000e+01, 5.40176402e-02, 7.00000000e+00],
   [3.80000000e+01, 3.90000000e+01, 6.15118462e-02, 4.00000000e+00],
   [1.50000000e+01, 4.60000000e+01, 7.54874869e-02, 7.00000000e+00],
   [2.80000000e+01, 5.00000000e+01, 9.55487454e-02, 8.00000000e+00],
   [5.20000000e+01, 5.30000000e+01, 3.86911095e-01, 1.50000000e+01],
   [4.90000000e+01, 5.40000000e+01, 4.16693529e-01, 2.00000000e+01],
   [4.80000000e+01, 5.50000000e+01, 4.58764920e-01, 2.40000000e+01],
   [3.60000000e+01, 5.60000000e+01, 5.23422380e-01, 2.60000000e+01],
   [5.10000000e+01, 5.70000000e+01, 5.49419077e-01, 3.00000000e+01]])

from matplotlib import pyplot as plt

plt.figure(figsize=(25, 10))
dn = dendrogram(z_sim)
plt.show()

accuracy for clustering of data is compared with this photo: https://drive.google.com/file/d/1EgkPqwh7AKhGqOe1zf9KNjSMxPQ9Xfd9/view?usp=sharing

The dendogram that I got are available in the following notebook link: https://drive.google.com/file/d/1TB7aFK4lPDo43GY74FPOqVOx1AxWV-A_/view?usp=sharing open this html using internet browser.

Solution

Scipy only supports distances for HAC, not similarities.

Then the results should be the same. So there is no "right" or "wrong".

At some point you need the distance matrix in linearized form. It is probably most efficient to use a) a method that can process sparse data (avoiding any todense call), and b) directly produces the linearize form, rather than generating the entire matrix and then dropping half of it.