Search code examples
rlistcosine-similaritydoc2vec

How to convert the items of a list column into their own columns to find cosine similarity in R?


I have a data set that looks like this:

library(tidyverse)

data <- tibble(id = 1:10,
               vectors = list(rnorm(25)))

# A tibble: 25 x 2
      id vectors   
   <int> <list>    
 1     1 <dbl [25]>
 2     2 <dbl [25]>
 3     3 <dbl [25]>
 4     4 <dbl [25]>
 5     5 <dbl [25]>
 6     6 <dbl [25]>
 7     7 <dbl [25]>
 8     8 <dbl [25]>
 9     9 <dbl [25]>
10    10 <dbl [25]>

I'd like to use this data set to find cosine similarity where each row represents a document. The cosine function from the lsa package seems like a good/easy way to do this, however I would need each document represented as a column. I'd like to simply to do data %>% t() to get my desired result, but that's not working. I've also tried "spreading" the list column first using unest and spread. I've also tried flatten to no avail. The first line of my desired output would look something like:

  1    2    3    4    5    6    7    8    9    10
0.1  0.3  0.7  0.3  0.1  0.1  0.3  0.7  0.3  0.1

If there's a function from another package that handles data in this format I would by all means just use that instead though at this point I would like to figure this out from a curiosity standpoint. I've looked at R - list to data frame, but I'm not sure how I can apply that to this situation.

The background to this is that I've performed doc2vec in python with gensim but do to our environment in work, if I want to build something interactive for a client it would need to be in R.


Solution

  • require(dplyr)
    require(tidyr)
    mutate(data,vectors=sapply(vectors, function(x) paste(x,collapse=","))) %>% 
        separate_rows(vectors,sep=",") %>% 
        group_by(id) %>% 
        mutate(numb=row_number(),vectors=as.numeric(vectors)) %>%
        spread(key=numb,value=vectors)
    
    # A tibble: 10 x 26
    # Groups:   id [10]
          id   `1`   `2`   `3`   `4`    `5`   `6`    `7`   `8`     `9`  `10`  `11`  `12`   `13`   `14`  `15`   `16`
       <int> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl>  <dbl>
     1     1  1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579  1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
     2     2  1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579  1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
     3     3  1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579  1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
     4     4  1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579  1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
     5     5  1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579  1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
     6     6  1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579  1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
     7     7  1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579  1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
     8     8  1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579  1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
     9     9  1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579  1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
    10    10  1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579  1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
    # ... with 9 more variables: `17` <dbl>, `18` <dbl>, `19` <dbl>, `20` <dbl>, `21` <dbl>, `22` <dbl>, `23` <dbl>,
    #   `24` <dbl>, `25` <dbl>
    

    I find it's easiest to spread data by first gathering it into a long-data format. We achieve that using separate_rows. The problem there is that we first need to transform the lists in vectors into something separate_rows can work with. We do that using paste with collapse="," within a sapply (otherwise all the lists will be pasted together).

    Once we have that it's just a matter of grouping, adding a row-index column (and transforming the numbers back to numeric), and spreading to achieve the desired format.