Search code examples
rnlptext-miningtmopennlp

Count number of verbs for each speech in data frame R


I have a data frame as the following:

str(data)
'data.frame':   255 obs. of  3 variables:
$ Group      : Factor w/ 255 levels "AlzGroup1","AlzGroup10",..: 1 112 179 190 201 212 223 234 245 2 ...
$ Gender     : int  1 1 0 0 0 0 0 1 0 0 ...
$ Description: Factor w/ 255 levels "A boy's on the uh falling off the stool picking up cookies . The girl's reaching up for it . The girl the lady "| __truncated__,..: 63 69 38 134 111 242 196 85 84 233 ...

in the Description column I have 255 speeches and I want to add a column to my data frame containing number of verbs in each speech, I know how to get number of verbs but the following code gives me total number of verbs in Description column:

> library(NLP);
> library(tm);
> library(openNLP);
NumOfVerbs=sapply(strsplit(as.character(tagPOS(data$Description)),"[[:punct:]]*/VB.?"),function(x) {res = sub("(^.*\\s)(\\w+$)", "\\2", x); res[!grepl("\\s",res)]} )

Does anyone know how can I get number of verbs in each speech?

Thanks for any help!

Elahe


Solution

  • Assuming you are using function similar to this one (found here: could not find function tagPOS):

    tagPOS <-  function(x, ...) {
      s <- as.String(x)
      word_token_annotator <- Maxent_Word_Token_Annotator()
      a2 <- Annotation(1L, "sentence", 1L, nchar(s))
      a2 <- annotate(s, word_token_annotator, a2)
      a3 <- annotate(s, Maxent_POS_Tag_Annotator(), a2)
      a3w <- a3[a3$type == "word"]
      POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
      POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
      list(POStagged = POStagged, POStags = POStags)
    }
    

    Create a function that counts the number of POS tags that contain the letters 'VB'

    count_verbs <-function(x) {
      pos_tags <- tagPOS(x)$POStags
      sum(grepl("VB", pos_tags))
      }
    

    And use dplyr to group by Group and summarise using count_verbs():

    library(dplyr)
    data %>% 
      group_by(Group) %>%
      summarise(num_verbs = count_verbs(Description))