Here is my code: example 1:
a <- c("ab cd de","ENERGIZER A23 12V ALKALINE BATTERi")
a1 <- VCorpus(VectorSource(a))
a2 <- TermDocumentMatrix(a1,control = list(stemming=T))
The result is:
Terms 1 2
12v 0 1
a23 0 1
alkalin 0 1
batteri 0 1
energ 0 1
Looks like the first string in a is ignored.
example 2
a <- c("abcd cde de","ENERGIZER A23 12V ALKALINE BATTERi")
a1 <- VCorpus(VectorSource(a))
a2 <- TermDocumentMatrix(a1,control = list(stemming=T))
The result is:
Terms 1 2
12v 0 1
a23 0 1
abcd 1 0
alkalin 0 1
batteri 0 1
cde 1 0
energ 0 1
We can see two sub-strings (abcd, cde) are kept while the shorest one (de) is still missing. The situation is the same if I do not use control = list(stemming=T). So, I am curious if this is a sort of definition in tm? The strings will be ignored if it is less than 3 letters? I do not think this is a good idea. It is very possible that a string is useful even it is short such as abbreviation.
If so, is there a parameter or something that can change this? Thanks a lot.
See ?termFreq
. The option you have to set is wordLengths
. From the doc:
An integer vector of length 2. Words shorter than the minimum word length ‘wordLengths[1]’ or longer than the maximum word length ‘wordLengths[2]’ are discarded. Defaults to ‘c(3, Inf)’, i.e., a minimum word length of 3 characters.
So, if you don't want to exclude short words you can:
a2 <- TermDocumentMatrix(a1,control = list(stemming=T,wordLengths=c(1,Inf)))
Terms 1 2
12v 0 1
a23 0 1
ab 1 0
alkalin 0 1
batteri 0 1
cd 1 0
de 1 0
energ 0 1