Search code examples
pythonpandascosine-similarity

Row wise cosine similarity calculation in pandas


I want to calculate the row wise cosine similarity between every consecutive row. The dataframe is already sorted on the id and date.

I tried looking at the solutions here in stack overflow, but the use case seems to be a bit different in all the cases. I have many more features, around 32 in total, and I want to consider all those feature columns (Paths modified, tags modified and endpoints added in the df above are examples of some features), and calculate the distance metric for each row.

This is what I could think of,but it does not fulfil the purpose:

df = pd.DataFrame(np.random.randint(0, 5, (3, 5)), columns=['id', 'date', 'feature1', 'feature2', 'feature3'])

similarity_df = df.iloc[:, 2:].apply(lambda x: cosine_similarity([x], df.iloc[:, 2:])[0], axis=1)

Does anyone have suggestions on how could I proceed with this?


Solution

  • I was able to figure it how somehow, the loop is something I was looking for, since some of the api_spec_id's were not getting assigned NaN and the distance was getting calculated which is wrong.

    import numpy as np
    from sklearn.metrics.pairwise import cosine_similarity
    
    # Feature columns to use for cosine similarity calculation
    cols_to_use = labels.loc[:, "Info_contact_name_changes":"Paths_modified"].columns
    
    # New column for cosine similarity
    labels['cosine_sim'] = np.nan
    
    # Looping through each api_spec_id
    for api_spec_id in labels['api_spec_id'].unique():
        # Get the rows for the current api_spec_id
        api_rows = labels[labels['api_spec_id'] == api_spec_id].sort_values(by='commit_date')
    
        # Set the cosine similarity of the first row to NaN, since there is no previous row to compare to
        labels.loc[api_rows.index[0], 'cosine_sim'] = np.nan
        
        # Calculate the cosine similarity for consecutive rows
        for i in range(1, len(api_rows)):
            # Get the previous and current row
            prev_row = api_rows.iloc[i-1][cols_to_use]
            curr_row = api_rows.iloc[i][cols_to_use]
            
            # Calculate the cosine similarity and store it in the 'cosine_sim' column
            cosine_sim = cosine_similarity([prev_row], [curr_row])[0][0]
            labels.loc[api_rows.index[i], 'cosine_sim'] = cosine_sim