Search code examples
rcountgrepl

Grepl group of strings and count frequency of all using R


I have a column of 50k rows of tweets named text from a csv file (the tweets consists of sentences, phrases etc). I'm trying to count frequency of several words in that column. Is there an easier way to do it vs what I'm doing below?

# Reading my file
tweets <- read.csv('coffee.csv', header=TRUE)


# Doing a grepl per word (This is hard because I need to look for many words one by one)
coffee    <- grepl("coffee", text$tweets, ignore.case=TRUE)
mugs    <- grepl("mugs", text$tweets, ignore.case=TRUE)


# Calculate the % of times among all tweets (This is hard because I need to calculate one by one)

sum(coffee) / nrow(text)
sum(starbucks) / nrow(text)

Expected Output (assuming I have more than 2 words up there)

Word   Freq
coffee  50
mugs    40
cup     64
pen     12

Solution

  • You can create a vector of the words that you want to count frequency/percentage for and use sapply to calculate them.

    words <- c('coffee', 'mugs')
    
    data.frame(words, t(sapply(paste0('\\b', words, '\\b'), function(x) {
      tmp <- grepl(x, tweets$text)
      c(perc = mean(tmp) * 100, 
        Freq = sum(tmp))
    })), row.names = NULL) -> result
    result
    
    #   words     perc Freq
    #1 coffee 33.33333    1
    #2   mugs 66.66667    2
    

    sapply is similar to for loop as it iterates over each word defined in words. grepl returns TRUE/FALSE values indicating if the word is present in tweets$text which is stored in tmp. To count the frequency we use sum and for percentage we use mean. Also added word boundary (\\b) to the words so that they match completely in the text hence 'coffee' does not match with 'coffees' etc.

    data

    tweets <- data.frame(text = c('This is text with coffee in it with lot of mugs', 
                                  'This has only mugs', 
                                  'This has nothing'))