Search code examples
rpdftm

DocumentTermMatrix with Sparsity 0%


I'm trying to obtain a document term matrix from a book in Italian. I have the pdf file of this book and I wrote few rows of code:

#install.packages("pdftools")
library(pdftools)
library(tm)
text <- pdf_text("IoRobot.pdf")
# collapse pdf pages into 1
text <- paste(unlist(text), collapse ="")
myCorpus <- VCorpus(VectorSource(text))
mydtm <-DocumentTermMatrix(myCorpus,control = list(removeNumbers = TRUE, removePunctuation = TRUE,
                                 stopwords=stopwords("it"), stemming=TRUE))
inspect(mydtm)

The result I obtained after the last row is:

<<DocumentTermMatrix (documents: 1, terms: 10197)>>
Non-/sparse entries: 10197/0
Sparsity           : 0%
Maximal term length: 39
Weighting          : term frequency (tf)
Sample             :
    Terms
Docs calvin cosa donovan esser piú poi powel prima quando robot
   1    201  191     254   193 288 211   287   166    184   62

I noticed that the sparsity is 0%. Is this normal?


Solution

  • Yes it seems correct.
    A document term matrix is a matrix that has as rows the documents, as columns the terms, and 0 or 1 if the term is in the document in the row (1) or not (0).
    Sparsity is and indicator that points out the "quantity of 0s" in document term matrix.
    You can define a sparse term, when it's not in a document, looking from here.

    To understand those gists, let's have a look to a reproducible example that creates a situation similar to your:

    library(tm)
    text <- c("here some text")
    corpus <- VCorpus(VectorSource(text))
    DTM <- DocumentTermMatrix(corpus)
    DTM
    
    <<DocumentTermMatrix (documents: 1, terms: 3)>>
    Non-/sparse entries: 3/0
    Sparsity           : 0%
    Maximal term length: 4
    Weighting          : term frequency (tf)
    

    Looking at the output, we can see you have one document (so a DTM with that corpus is made of one line).
    Having a look at it:

    as.matrix(DTM)
        Terms
    Docs here some text
       1    1    1    1
    

    Now it could be easier to understand the output:

    • You have one doc with three terms:

      <<DocumentTermMatrix (documents: 1, terms: 3)>>

    • Your non sparse (i.e. != 0 in DTM) are 3, and sparse == 0:

      Non-/sparse entries: 3/0

    So your sparsity is == 0%, because you cannot have some 0s in one document corpus; every term belongs to the unique document, so you'll have all ones:

      Sparsity           : 0%
    

    Having a look at a different example, that has sparse terms:

    text <- c("here some text", "other text")
    
    corpus <- VCorpus(VectorSource(text))
    DTM <- DocumentTermMatrix(corpus)
    
    DTM
    <<DocumentTermMatrix (documents: 2, terms: 4)>>
    Non-/sparse entries: 5/3
    Sparsity           : 38%
    Maximal term length: 5
    Weighting          : term frequency (tf)
    
    as.matrix(DTM)
        Terms
    Docs here other some text
       1    1     0    1    1
       2    0     1    0    1
    

    Now you have 3 sparse terms (3/5), and if you do 3/8 = 0.375 i.e. the 38% of sparsity.