Search code examples
rtmword-frequency

Calculating word frequency for multi-words in R?


I'm trying to compute the frequency of multi-words in a given text. For instance, consider the text: "Environmental Research Environmental Research Environmental Research study science energy, economics, agriculture, ecology, and biology". And then I want the number of times the combined words "environmental research" occurs in the text. Here is the code that I've tried.

library(tm)
#Reading the data
text = readLines(file.choose())
text1 = Corpus(VectorSource(text))

#Cleaning the data
text1 = tm_map(text1, content_transformer(tolower))
text1 = tm_map(text1, removePunctuation)
text1 = tm_map(text1, removeNumbers)
text1 = tm_map(text1, stripWhitespace)
text1 = tm_map(text1, removeWords, stopwords("english"))

#Making a document matrix
dtm = TermDocumentMatrix(text1)
m11 = as.matrix(text1)
freq11 = sort(rowSums(m11), decreasing=TRUE)
d11 = data.frame(word=names(freq11), freq=freq11)
head(d11,9)

This code, however, produces the frequency of each word separately. Instead, how do I obtain the number of times "environmental research" occurs together in the text? Thanks!


Solution

  • If you have a list of multiwords already and you want to compute their frequency in a text, you can use str_extract_all:

    text <- "Environmental Research Environmental Research Environmental Research study science energy, economics, agriculture, ecology, and biology"
    
    library(stringr)
    str_extract_all(text, "[Ee]nvironmental [Rr]esearch")
    [[1]]
    [1] "Environmental Research" "Environmental Research" "Environmental Research"
    

    If you want to know how often the multiword occurs you can do this:

    length(unlist(str_extract_all(text, "[Ee]nvironmental [Rr]esearch")))
    [1] 3
    

    If you're interested in extracting all multiwords at once you can proceed like this:

    First define a vector with all multiwords:

    multiwords <- c("[Ee]nvironmental [Rr]esearch", "study science energy")
    

    Then use paste0 to collapse them into a single string of alternative patterns and use str_extract_all on that string:

    str_extract_all(text, paste0(multiwords, collapse = "|"))
    [[1]]
    [1] "Environmental Research" "Environmental Research" "Environmental Research" "study science energy"
    

    To get the frequencies of the multiwords you can use table:

    table(str_extract_all(text, paste0(multiwords, collapse = "|")))
    
    Environmental Research   study science energy 
                         3                      1