Search code examples
rtmterm-document-matrix

Does tm automatically ignore the very short strings?


Here is my code: example 1:

a <- c("ab cd de","ENERGIZER A23 12V ALKALINE BATTERi")
a1 <- VCorpus(VectorSource(a))
a2 <- TermDocumentMatrix(a1,control = list(stemming=T))
inspect(a2)

The result is:

         Docs
Terms     1 2
  12v     0 1
  a23     0 1
  alkalin 0 1
  batteri 0 1
  energ   0 1

Looks like the first string in a is ignored.

example 2

a <- c("abcd cde de","ENERGIZER A23 12V ALKALINE BATTERi")
a1 <- VCorpus(VectorSource(a))
a2 <- TermDocumentMatrix(a1,control = list(stemming=T))
inspect(a2)

The result is:

         Docs
Terms     1 2
  12v     0 1
  a23     0 1
  abcd    1 0
  alkalin 0 1
  batteri 0 1
  cde     1 0
  energ   0 1

We can see two sub-strings (abcd, cde) are kept while the shorest one (de) is still missing. The situation is the same if I do not use control = list(stemming=T). So, I am curious if this is a sort of definition in tm? The strings will be ignored if it is less than 3 letters? I do not think this is a good idea. It is very possible that a string is useful even it is short such as abbreviation.

If so, is there a parameter or something that can change this? Thanks a lot.


Solution

  • See ?termFreq. The option you have to set is wordLengths. From the doc:

    An integer vector of length 2. Words shorter than the minimum word length ‘wordLengths[1]’ or longer than the maximum word length ‘wordLengths[2]’ are discarded. Defaults to ‘c(3, Inf)’, i.e., a minimum word length of 3 characters.

    So, if you don't want to exclude short words you can:

    a2 <- TermDocumentMatrix(a1,control = list(stemming=T,wordLengths=c(1,Inf)))
    inspect(a2)
             Docs
    Terms     1 2
      12v     0 1
      a23     0 1
      ab      1 0
      alkalin 0 1
      batteri 0 1
      cd      1 0
      de      1 0
      energ   0 1