Search code examples
rnlptext-miningtm

Word Association In R


I am searching for a solution/library or any function that finds the most frequent word associations within a paragraph. For example:

This tree gives red apple. Bananas are yellow. The apple I ate was red.

In the above text, we should be able to get Association of each word with all other words in the sentence (after removing stop words and stemming). So lets say the above text gives association as:

tree - red : 0.41 tree - apple : 0.46 bananas - yellow: 0.30 apple - red : 0.8

The most frequent two words repeated in the text are "apple - red" combination since both of the words occur in two sentences.

The two solutions I have tried are :

  1. findAssoc() of tm library:

          Word AssociatedWord Association  
    1    apple            red           1
    2    apple            ate         0.5
    3    apple           tree         0.5
    4      red          apple           1
    5      red            ate         0.5
    6      red           tree         0.5
    7      ate          apple         0.5 
    8      ate            red         0.5  
    9  bananas         yellow           1  
    10    tree          apple         0.5 
    11    tree            red         0.5
    12  yellow        bananas           1
    

    The result shown above is the output of text given above. Sentences are entered individually as it does not find association on a single line text.

  2. A customized solution using most frequent n-grams: this is not feasible since it only checks consecutively occurring words.

I am just looking for a solution that gives the most frequent word association. I can't break the text into multiple lines so would there be any solution of such kind? Any help would be appreciated.


Solution

  • It is not clear at all what you want. What do you mean by frequent words association on a single line of text? Association values require a metric, in findAssc() the metric is reflecting how many times 2 words appear in the same text.

    When you have something like "This tree gives red apple" in a document the information you have is that tree-apple are both in the same document, that's it, and maybe that they are separated by 2 words, or something like that, what do you want as metric here? define one.