Search code examples
rtext-miningtm

word association - findAssocs and numeric (0)


I'm just getting to grips with the tm package in R.

Probably a simple question, but trying to use the findAssocs function to get an idea for word associations in my customer enquiries insight document and I can't seem to get findAssocs to work correctly.

When I use the following:

findAssocs(dtm, words, corlimit = 0.30)
 $population
  numeric(0)

 $migration
 numeric(0)

What does this mean? Words is a character vector of 667 words - surely there must be some correlative relationships?


Solution

  • Consider the following example:

    library(tm)
    corp <- VCorpus(VectorSource(
              c("hello world", "hello another World ", "and hello yet another world")))
    tdm <- TermDocumentMatrix(corp)
    inspect(tdm)
    #          Docs
    # Terms     1 2 3
    #   and     0 0 1
    #   another 0 1 1
    #   hello   1 1 1
    #   world   1 1 1
    #   yet     0 0 1
    

    Now consider

    findAssocs(x=tdm, terms=c("hello", "yet"), corlimit=.4)
    # $hello
    # numeric(0)
    # 
    # $yet
    #     and another 
    #     1.0     0.5 
    

    From what I understand, findAssocs looks at the correlations of hello with everything but hello and yet, as well as yet with everything except hello and yet. yet and and have a correlation coefficient of 1.0, which is above the lower limit of 0.4. yet is also in 50% of all documents containing another - that's also above our 0.4 limit.

    Here's another example showcasing this:

    findAssocs(x=tdm, terms=c("yet", "another"), corlimit=0)
    # $yet
    # and 
    #   1 
    # 
    # $another
    # and 
    # 0.5 
    

    Note that hello (and world) don't yield any results because they are in every document. This means the term frequency has zero variance and cor under the hood yields NA (like cor(rep(1,3), 1:3), which gives NA plus a zero-standard-deviation-warning).