Search code examples
rstringtext-miningterm-document-matrix

Not able to see single digit/letter as a term in after creating TermDocument Matrix


I used TermDocument Matrix in R, and documents(strings) include single letter words also. After using TermDocument Matrix, the terms do not include those single letter words, please suggest which control should I include as an input argument in order to include single letter word in my term document matrix.`


Solution

  • By default the min wordlength is 3. you need to specify the parameter as control to override the default, check out the following code.

    library(tm)
    docs <- c("This is a text","When Will u start", "1 12 123")
    corpus <- Corpus(VectorSource(docs))
    
    as.matrix(DocumentTermMatrix(corpus)) #words with length < 3 ('a','u','1','12') are excluded
    #    Terms
    #Docs 123 start text this when will
    #   1   0     0    1    1    0    0
    #   2   0     1    0    0    1    1
    #   3   1     0    0    0    0    0
    
    as.matrix(DocumentTermMatrix(corpus, control = list(wordLengths=c(1,Inf))))
    #    Terms
    #Docs 1 12 123 a is start text this u when will
    #   1 0  0   0 1  1     0    1    1 0    0    0
    #   2 0  0   0 0  0     1    0    0 1    1    1
    #   3 1  1   1 0  0     0    0    0 0    0    0