Search code examples
pythonpython-3.xpandasscikit-learncosine-similarity

Python sklearn cosine-similarity loop for all records


i have dataframe named df. I'm using code below to get the cosine similarity for each row:

vectorizer = CountVectorizer()
features = vectorizer.fit_transform(df['name']).todense()
for f in features:
    for index, row in df.iterrows():
        df['index'+str(index)] = pd.DataFrame(cosine_similarity(features,f))
df

but the output DataFrame shows the same result for each records where I assume that it refers to the last record:

   name                                   index0     index1    index2     index3       index4
0   aaaabbbbbbcccc                     0.158114  0.158114   0.158114    0.158114    0.158114
1   ddddffffffgggg                     0.204124  0.204124   0.204124    0.204124    0.204124
2   hhhhhhiiiiiijjjjj                  0.158114  0.158114   0.158114    0.158114    0.158114
3   kkkkkklllllllmmmm                  0.235702  0.235702   0.235702    0.235702    0.235702
4   mmmmmnnnnnnooooooo                 1.000000  1.000000   1.000000    1.000000    1.000000

I want the output for all records


Solution

  • IIUC you simply need:

    for i, f in enumerate(features):
        address['index'+str(i)] = pd.DataFrame(cosine_similarity(features,f))
    address