Search code examples
pythonarrayspandasnlpcosine-similarity

Export Cosine Simularity Array out as a Matrix with Labels


Short version: I have a array and need to create a matrix but with names labels on top and side and export like example csv. (sorry if may wording incorrect)

Long version: I made a recommendation system self taught and have a website ready after a year in quarantine learning and troubleshooting here on so usually a few day of searching I figure it out, but this got me stuck for about 3 weeks now.

The recommendation system system works in python I can put in a name and it spits of the recommended names i tweaked it and got it to acceptable results. But in the books, website and tutorial and udemy classes etc. Never learn how to take the python and make a Django site to get it to work.

This what the output is like currently is

# creating a Series for the name of the character so they are associated to an ordered numerical
# list I will use later to match the indexes
indices = pd.Series(df.index)
indices[:5]

# instantiating and generating the count matrix

count = CountVectorizer()

count_matrix = count.fit_transform(df['bag_of_words'])

​

# creating a Series for the name of the character so they are associated to an ordered numerical

# list I will use later to match the indexes

indices = pd.Series(df.index)

indices[:5]

0             ZZ Top
1         Zyan Malik
2    Zooey Deschanel
3       Ziggy Marley
4                ZHU
Name: name, dtype: object

# generating the cosine similarity matrix
cosine_sim = cosine_similarity(count_matrix, count_matrix)
cosine_sim

array([[1.        , 0.11708208, 0.10192614, ..., 0.        , 0.        ,
       0.        ],
      [0.11708208, 1.        , 0.1682581 , ..., 0.        , 0.        ,
       0.        ],
      [0.10192614, 0.1682581 , 1.        , ..., 0.        , 0.        ,
       0.        ],
      ...,
      [0.        , 0.        , 0.        , ..., 1.        , 1.        ,
       1.        ],
      [0.        , 0.        , 0.        , ..., 1.        , 1.        ,
       1.        ],
      [0.        , 0.        , 0.        , ..., 1.        , 1.        ,
       1.        ]])

# I need to then export to csv which I understand

.to_csv('artist_similarities.csv')

Desired Exports

I am trying to have the array with the index name in what i think is called a matrix like this example.

what the exported csv looks like

              scores             ZZ Top             Zyan Malik             Zooey Deschanel            ZHU
0             ZZ Top             0            65.61249881            24.04163056             24.06241883
1         Zyan Malik             65.61249881             0            89.35882721                69.6634768
2    Zooey Deschanel             24.04163056             89.40917179             0             20.09975124
3                ZHU             7.874007874             69.6634768             20.09975124             0
# function that takes in the character name as input and returns the top 10 recommended characters
def recommendations(name, cosine_sim = cosine_sim):
    
    recommended_names = []
    
    # getting the index of the movie that matches the title
    idx = indices[indices == name].index[0]

    # creating a Series with the similarity scores in descending order
    score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)

    # getting the indexes of the 10 most characters
    top_10_indexes = list(score_series.iloc[1:11].index)
    
    # populating the list with the names of the best 10 matching characters
    for i in top_10_indexes:
        recommended_names.append(list(df.index)[i])
        
    return recommended_names

# working results which for dataset are pretty good 

recommendations('Blues Traveler')

['G-Love & The Special Sauce',
 'Phish',
 'Spin Doctors',
 'Grace Potter and the Nocturnals',
 'Jason Mraz',
 'Pearl Jam',
 'Dave Matthews Band',
 'Lukas Nelson & Promise of the Real ',
 'Vonda Shepard',
 'Goo Goo Dolls']

Solution

  • I'm not sure I understand what you're asking and I can't comment so I'm forced to write here. I assume you want to add column and index fields to the cosine_sim array. You could do something like this:

    cos_sim_df = pd.DataFrame(cosine_sim, index=indices, columns=indices)
    cos_sim_df.to_csv("artist_similarities.csv")
    

    And then read the csv like

    cos_sim_df = pd.read_csv("artist_similarities.csv", header=0, index_col=0)
    

    To make sure pandas knows the first row and columns are field names. Also I assumed your column and row indices are the same, you can change them if you need. Another thing, this won't be exactly like the desired exports because in that csv there is a "score" field which contains the names of the artists, though it seems like the artists should be field names. If you want the exported csv to look exactly like the desired exports you can add the artists in a "score" field like this:

    cos_sim_df = pd.DataFrame(cosine_sim, columns=indices)
    cos_sim_df["score"] = indices
    # make the score field the first field
    cos_sim_df = cos_sim_df[["score", *idx]]
    

    Lastly I want to note that indexing data frames is row-major, and it seems you visualized the fields as column indices, for this specific case since your array has a line of symmetry across the diagonal, it doesn't matter which axis is indexed because cos_sim_df["Zayn Malik"] for example will return the same values anyway, but keep this in mind if your array isn't symmetrical.