I have a Document-Features-Matrix (DFM): I want to convert it into a LSA object and finally to compute cosine similarity between each documents.
this are the passages I followed
lsa_t2 <- convert(DFM_tfidf, to = "lsa" , omit_empty = TRUE)
t2_lsa_tfidf_cos_sim = sim2(x = lsa_t2, method = "cosine", norm = "l2")
but I get this error:
Error in sim2(x = lsa_t2, method = "cosine", norm = "l2") :
inherits(x, "matrix") || inherits(x, "Matrix") is not TRUE
to give more context this is what las_t2 looks like
any of the documents contain text (I check it already) and I filtered outdocuments without text before I cleated the dfm.
Any idea of what happened?
The error you receive basically means that the function sim2
does not work with the lsa
object. However, I'm not really sure if I understand the question. Why do you want to convert the dfm
to lsa
textmatrix format in the first place?
If you want to calculate cosine similarity between texts, you can do this directly in quenteda
library(quanteda)
texts <- c(d1 = "Shipment of gold damaged in a fire",
d2 = "Delivery of silver arrived in a silver truck",
d3 = "Shipment of gold arrived in a truck" )
texts_dfm <- dfm(texts)
textstat_simil(texts_dfm,
margin = "documents",
method = "cosine")
#> textstat_simil object; method = "cosine"
#> d1 d2 d3
#> d1 1.000 0.359 0.714
#> d2 0.359 1.000 0.598
#> d3 0.714 0.598 1.000
If you want to use sim2
from text2vec
, you can do so using the same object without converting it first:
library(text2vec)
sim2(x = texts_dfm, method = "cosine", norm = "l2")
#> 3 x 3 sparse Matrix of class "dsCMatrix"
#> d1 d2 d3
#> d1 1.0000000 0.3585686 0.7142857
#> d2 0.3585686 1.0000000 0.5976143
#> d3 0.7142857 0.5976143 1.0000000
As you can see, the results are the same.
As by the comments, I now understand that you want to apply a transformation of your data via Latent semantic analysis. You can follow the tutorial linked below and plug in the dfm instead of the dtm that is used in the tutorial:
texts_dfm_tfidf <- dfm_tfidf(texts_dfm)
library(text2vec)
lsa = LSA$new(n_topics = 2)
dtm_tfidf_lsa = fit_transform(texts_dfm_tfidf, lsa) # I get a warning here, probably due to the size of the toy dfm
d1_d2_tfidf_cos_sim = sim2(x = dtm_tfidf_lsa, method = "cosine", norm = "l2")
d1_d2_tfidf_cos_sim
#> d1 d2 d3 d4
#> d1 1.000000000 -0.002533794 0.5452992 0.999996189
#> d2 -0.002533794 1.000000000 0.8368571 -0.005294431
#> d3 0.545299245 0.836857086 1.0000000 0.542983071
#> d4 0.999996189 -0.005294431 0.5429831 1.000000000
Note that these results differ from run to run unless you use set.seed()
.
Or if you want to do everything in quanteda
:
texts_lsa <- textmodel_lsa(texts_dfm_tfidf, 2)
textstat_simil(as.dfm(texts_lsa$docs),
margin = "documents",
method = "cosine")
#> textstat_simil object; method = "cosine"
#> d1 d2 d3 d4
#> d1 1.00000 -0.00684 0.648 1.00000
#> d2 -0.00684 1.00000 0.757 -0.00894
#> d3 0.64799 0.75720 1.000 0.64638
#> d4 1.00000 -0.00894 0.646 1.00000