I'm just getting to grips with the tm
package in R.
Probably a simple question, but trying to use the findAssocs
function to get an idea for word associations in my customer enquiries insight document and I can't seem to get findAssocs
to work correctly.
When I use the following:
findAssocs(dtm, words, corlimit = 0.30)
$population
numeric(0)
$migration
numeric(0)
What does this mean? Words
is a character vector of 667 words - surely there must be some correlative relationships?
Consider the following example:
library(tm)
corp <- VCorpus(VectorSource(
c("hello world", "hello another World ", "and hello yet another world")))
tdm <- TermDocumentMatrix(corp)
inspect(tdm)
# Docs
# Terms 1 2 3
# and 0 0 1
# another 0 1 1
# hello 1 1 1
# world 1 1 1
# yet 0 0 1
Now consider
findAssocs(x=tdm, terms=c("hello", "yet"), corlimit=.4)
# $hello
# numeric(0)
#
# $yet
# and another
# 1.0 0.5
From what I understand, findAssocs
looks at the correlations of hello
with everything but hello
and yet
, as well as yet
with everything except hello
and yet
. yet
and and
have a correlation coefficient of 1.0
, which is above the lower limit of 0.4
. yet
is also in 50% of all documents containing another
- that's also above our 0.4 limit.
Here's another example showcasing this:
findAssocs(x=tdm, terms=c("yet", "another"), corlimit=0)
# $yet
# and
# 1
#
# $another
# and
# 0.5
Note that hello
(and world
) don't yield any results because they are in every document. This means the term frequency has zero variance and cor
under the hood yields NA
(like cor(rep(1,3), 1:3)
, which gives NA
plus a zero-standard-deviation-warning).