Search code examples
pythonpython-3.xvectorcluster-analysissimilarity

Understanding np.zeros in clustering


I'm learning about clustering and I've seen in several tutorials something that I don't quite understand in the part of similarity measures:

tfidf_vector = TfidfVectorizer()
tfidf_matrix = tfidf_vector.fit_transform(file)

#and/or

count_vector = CountVectorizer()
count_matrix = count_vector.fit_transform(file)

#AND HERE
file_size = len(file)
x = np.zeros((file_size, file_size))
#and here the similarity measures like cosine_similarity, jaccard...

for elm in range(file_size):
    x[elm] = cosine_similarity(tfidf_matrix[i:i+1], tfidf_matrix)

y = np.subtract(np.ones((file_size, file_size),dtype = np.float), x)

new_file = np.asarray(y)
w = new_file.reshape((1,file_size,file_size))

Why do we need np.zeros? Isn't tfidf_matrix/count_matrix sufficient for similarity measures?


Solution

  • this code do the same thing (I changed i to elm since it seems like a typo)

    x = []
    for elm in range(file_size):
        x.append(cosine_similarity(tfidf_matrix[elm:elm+1], tfidf_matrix)
    x = np.asarray(x)
    

    You could also have replaces np.zeros with np.empty. Creating the array beforehand and then filling in every element of the array is slightly more efficient than appending to a list and then transforming it to a numpy array. Many other programming languages requires arrays to be preallocated just like numpy does, which is why many people choose to fill an array in this way.

    However since this is python you should do whatever you feel is the most easy way for yourself and others to read.