Search code examples
pythonpandasgensimword2vecsimilarity

Text similarity using WMD within the same time period


I have a dataset

       Title                                                Year
0   Sport, there will be a match between United and Tottenham ...   2020
1   Forecasting says that it will be cold next week                 2019
2   Sport, Mourinho is approaching the anniversary at Tottenham     2020
3   Sport, Tottenham are sixth favourites for the title behind Arsenal. 2020
4   Pochettino says clear-out of fringe players at Tottenham is inevitable.     2018
... ... ...

I would like to study the text similarity within the same year, rather in the whole dataset. To find most similar texts, I am using the WM distance similarity. For two text would be:

word2vec_model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
distance = word2vec_model.wmdistance("string 1".split(), "string 2".split())

However I would need to iterate the distance through sentences in the same year to get the similarity of each text with others, creating a list of similar text per row in the dataframe. Could you please tell me how to iterate the wmdistance function across text published in the same year, in order to get for each text the most similar ones within the same period?


Solution

  • Generating a distance matrix for every group and then picking min value should work. This will get you a single nearest document index in a given year. You should be able to modify this code if you want n documents, or something else like that, quite easily.

    from scipy.spatial.distance import pdist, squareform
    
    def nearest_doc(group):
        sq = squareform(pdist(group.to_numpy()[:,None], metric=lambda x, y:word2vec_model.wmdistance(x[0], y[0])))
    
        return group.index.to_numpy()[np.argmin(np.where(sq==0, np.inf, sq), axis=1)]
    
    df['nearest_doc'] = df.groupby('Year')['Title'].transform(nearest_doc)
    

    result:

    Title   Year    nearest_doc
    0   Sport, there will be a match between United an...   2020    3
    1   Forecasting says that it will be cold next week     2019    1
    2   Sport, Mourinho is approaching the anniversary...   2020    3
    3   Sport, Tottenham are sixth favourites for the ...   2020    2
    4   Pochettino says clear-out of fringe players at...   2018    4