Search code examples
rtext-miningquanteda

Creating and computing percentage of co-occurrence based on keywords


I have the following data set:

df <- data.frame (text  = c("House Sky Blue",
                            "House Sky Green",
                            "House Sky Red",
                            "House Sky Yellow",
                            "House Sky Green",
                            "House Sky Glue",
                            "House Sky Green"))

I'd like to find the percentage of co-occurrence of some terms of tokens. For example, out of all documents, where can I find the token "House" and at the same time how many of them also include the term "Green"?

In out data we have 7 documents that have the term House and 3 out of those 7 p=(100*3/7) also include the term Green, It would be so nice to see also what terms or tokens appear within some p threshold along side the token "House".

I have used these two functions:

textstat_collocations(tokens)

> textstat_collocations(tokens)
  collocation count count_nested length   lambda        z
1   house sky     7            0      2 5.416100 2.622058
2   sky green     3            0      2 2.456736 1.511653

Fun textstat_simil

textstat_simil(dfm(tokens),margin="features")

textstat_simil object; method = "correlation"
       house sky   blue  green    red yellow   glue
house    NaN NaN    NaN    NaN    NaN    NaN    NaN
sky      NaN NaN    NaN    NaN    NaN    NaN    NaN
blue     NaN NaN  1.000 -0.354 -0.167 -0.167 -0.167
green    NaN NaN -0.354  1.000 -0.354 -0.354 -0.354
red      NaN NaN -0.167 -0.354  1.000 -0.167 -0.167
yellow   NaN NaN -0.167 -0.354 -0.167  1.000 -0.167
glue     NaN NaN -0.167 -0.354 -0.167 -0.167  1.000

but they do not seem to give my desired output also I wonder why the correlation btw green and house is NaN for the textsats_simil fun

My desired output would show the following info:

feature="House"
 percentage of co-occurrence 

Green = 3/7
Blue= 1/7
Red = 1/7
Yellow = 1/7
Glue = 1/7

In the quetda docs I can't seem to find a function that can give me my desired output, although I know there must be a way around since I find this library to be so fast and complete.


Solution

  • One way to do this is using the fcm() to get document-level co-occurrences for a target feature. Below, I show how to do this using fcm(), fcm_remove() to remove the target feature, then a loop to get the desired printed output.

    library("quanteda")
    #> Package version: 3.2.4
    #> Unicode version: 14.0
    #> ICU version: 70.1
    #> Parallel computing: 10 of 10 threads used.
    #> See https://quanteda.io for tutorials and examples.
    
    df <- data.frame(text = c("House Sky Blue",
                              "House Sky Green",
                              "House Sky Red",
                              "House Sky Yellow",
                              "House Sky Green",
                              "House Sky Glue",
                              "House Sky Green"))
    corp <- corpus(df)
    
    coocc_fract <- function(corp, feature) {
       # create a document-level co-occurrence matrix
       fcmat <- fcm(dfm(tokens(corp), tolower = FALSE), context = "document")
       # select for the given feature
       fcmat <- fcm_remove(fcmat, feature)
       cat("feature=\"", feature, "\"\n", sep = "")
       cat(" percentage of co-occurrence\n\n")
       for (f in featnames(fcmat)) {
           # skip zeroes
           freq <- as.character(fcmat[1, f])
           if (freq != "0") {
               cat(f, " = ", as.character(fcmat[1, f]), "/", ndoc(corp), 
                   "\n", sep = "")
           }
       }
    }
    

    This produces this output:

    coocc_fract(corp, feature = "House")
    #> feature="House"
    #>  percentage of co-occurrence
    #> 
    #> Blue = 1/7
    #> Green = 3/7
    #> Red = 1/7
    #> Yellow = 1/7
    #> Glue = 1/7
    

    Created on 2023-01-02 with reprex v2.0.2