I have a data set that looks like this:
library(tidyverse)
data <- tibble(id = 1:10,
vectors = list(rnorm(25)))
# A tibble: 25 x 2
id vectors
<int> <list>
1 1 <dbl [25]>
2 2 <dbl [25]>
3 3 <dbl [25]>
4 4 <dbl [25]>
5 5 <dbl [25]>
6 6 <dbl [25]>
7 7 <dbl [25]>
8 8 <dbl [25]>
9 9 <dbl [25]>
10 10 <dbl [25]>
I'd like to use this data set to find cosine similarity where each row represents a document. The cosine
function from the lsa
package seems like a good/easy way to do this, however I would need each document represented as a column. I'd like to simply to do data %>% t()
to get my desired result, but that's not working. I've also tried "spreading" the list column first using unest
and spread
. I've also tried flatten
to no avail. The first line of my desired output would look something like:
1 2 3 4 5 6 7 8 9 10
0.1 0.3 0.7 0.3 0.1 0.1 0.3 0.7 0.3 0.1
If there's a function from another package that handles data in this format I would by all means just use that instead though at this point I would like to figure this out from a curiosity standpoint. I've looked at R - list to data frame, but I'm not sure how I can apply that to this situation.
The background to this is that I've performed doc2vec in python with gensim but do to our environment in work, if I want to build something interactive for a client it would need to be in R.
require(dplyr)
require(tidyr)
mutate(data,vectors=sapply(vectors, function(x) paste(x,collapse=","))) %>%
separate_rows(vectors,sep=",") %>%
group_by(id) %>%
mutate(numb=row_number(),vectors=as.numeric(vectors)) %>%
spread(key=numb,value=vectors)
# A tibble: 10 x 26
# Groups: id [10]
id `1` `2` `3` `4` `5` `6` `7` `8` `9` `10` `11` `12` `13` `14` `15` `16`
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579 1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
2 2 1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579 1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
3 3 1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579 1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
4 4 1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579 1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
5 5 1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579 1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
6 6 1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579 1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
7 7 1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579 1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
8 8 1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579 1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
9 9 1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579 1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
10 10 1.46 0.140 0.209 -3.04 -0.487 -1.09 0.0579 1.10 -0.0256 0.515 0.990 0.303 -0.930 0.0840 0.527 0.0159
# ... with 9 more variables: `17` <dbl>, `18` <dbl>, `19` <dbl>, `20` <dbl>, `21` <dbl>, `22` <dbl>, `23` <dbl>,
# `24` <dbl>, `25` <dbl>
I find it's easiest to spread data by first gathering it into a long-data format. We achieve that using separate_rows
. The problem there is that we first need to transform the lists in vectors into something separate_rows
can work with. We do that using paste
with collapse=","
within a sapply (otherwise all the lists will be pasted together).
Once we have that it's just a matter of grouping, adding a row-index column (and transforming the numbers back to numeric), and spreading to achieve the desired format.