Search code examples
rnlpquantedalsa

R - convert DFM to LSA then compute cosine similarity: Error inherits(x, "Matrix") is not TRUE


I have a Document-Features-Matrix (DFM): I want to convert it into a LSA object and finally to compute cosine similarity between each documents.

this are the passages I followed

lsa_t2 <- convert(DFM_tfidf, to = "lsa" , omit_empty = TRUE)
t2_lsa_tfidf_cos_sim = sim2(x = lsa_t2, method = "cosine", norm = "l2")

but I get this error:

Error in sim2(x = lsa_t2, method = "cosine", norm = "l2") :
inherits(x, "matrix") || inherits(x, "Matrix") is not TRUE

to give more context this is what las_t2 looks like

How lsa_t2 looks like

any of the documents contain text (I check it already) and I filtered outdocuments without text before I cleated the dfm.

Any idea of what happened?


Solution

  • The error you receive basically means that the function sim2 does not work with the lsa object. However, I'm not really sure if I understand the question. Why do you want to convert the dfm to lsa textmatrix format in the first place?

    If you want to calculate cosine similarity between texts, you can do this directly in quenteda

    library(quanteda)
    
    texts <- c(d1 = "Shipment of gold damaged in a fire",
               d2 = "Delivery of silver arrived in a silver truck",
               d3 = "Shipment of gold arrived in a truck" )
    
    texts_dfm <- dfm(texts)
    
    textstat_simil(texts_dfm, 
                   margin = "documents",
                   method = "cosine")
    #> textstat_simil object; method = "cosine"
    #>       d1    d2    d3
    #> d1 1.000 0.359 0.714
    #> d2 0.359 1.000 0.598
    #> d3 0.714 0.598 1.000
    

    If you want to use sim2 from text2vec, you can do so using the same object without converting it first:

    library(text2vec)
    sim2(x = texts_dfm, method = "cosine", norm = "l2")
    #> 3 x 3 sparse Matrix of class "dsCMatrix"
    #>           d1        d2        d3
    #> d1 1.0000000 0.3585686 0.7142857
    #> d2 0.3585686 1.0000000 0.5976143
    #> d3 0.7142857 0.5976143 1.0000000
    

    As you can see, the results are the same.

    Update

    As by the comments, I now understand that you want to apply a transformation of your data via Latent semantic analysis. You can follow the tutorial linked below and plug in the dfm instead of the dtm that is used in the tutorial:

    texts_dfm_tfidf <- dfm_tfidf(texts_dfm)
    
    
    library(text2vec)
    lsa = LSA$new(n_topics = 2)
    dtm_tfidf_lsa = fit_transform(texts_dfm_tfidf, lsa) # I get a warning here, probably due to the size of the toy dfm
    d1_d2_tfidf_cos_sim = sim2(x = dtm_tfidf_lsa, method = "cosine", norm = "l2")
    d1_d2_tfidf_cos_sim
    #>              d1           d2        d3           d4
    #> d1  1.000000000 -0.002533794 0.5452992  0.999996189
    #> d2 -0.002533794  1.000000000 0.8368571 -0.005294431
    #> d3  0.545299245  0.836857086 1.0000000  0.542983071
    #> d4  0.999996189 -0.005294431 0.5429831  1.000000000
    

    Note that these results differ from run to run unless you use set.seed().

    Or if you want to do everything in quanteda:

    texts_lsa <- textmodel_lsa(texts_dfm_tfidf, 2)
    
    textstat_simil(as.dfm(texts_lsa$docs), 
                   margin = "documents",
                   method = "cosine")
    #> textstat_simil object; method = "cosine"
    #>          d1       d2    d3       d4
    #> d1  1.00000 -0.00684 0.648  1.00000
    #> d2 -0.00684  1.00000 0.757 -0.00894
    #> d3  0.64799  0.75720 1.000  0.64638
    #> d4  1.00000 -0.00894 0.646  1.00000