Search code examples
rvectornlpcosine-similarity

calculate cosine similarity of two words in R?


I have a text file and would like to create semantic vectors for each word in the file. I would then like to extract the cosine similarity for about 500 pairs of words. What is the best package in R for doing this?


Solution

  • If I understand your problem correctly, you want the cosine similarity of two vectors of words. Let us start with the cosine similiarity of two words only:

    library(stringdist)
    d <- stringdist("ca","abc",method="cosine")
    

    The result is d= 0.1835034 as expected.

    There is also a function stringdistmatrix() contained in that package which calculates the distance between all pairs of strings:

    > d <- stringdistmatrix(c('foo','bar','boo','baz'))
    > d
      1 2 3
    2 3    
    3 1 2  
    4 3 1 2
    

    For your purpose, you can simply use something like this

    stringdist(c("ca","abc"),c("aa","abc"),method="cosine")
    

    The result are the measure for the distances between ca and aa on the one hand and abc compared with abc on the other hand:

    0.2928932 0.0000000
    

    Disclaimer: The library stringdist is brand new (June 2019), but seems to work nicely. I am not associated with the authors of the library.