Search code examples
rtext-miningpos-tagger

Extracting the POS tags in R using


In my dataset I am trying to create variables containing the number of nouns, verbs and adjectives, respectively for each observation. Using the openNLP package I have managed to get this far:

s <- paste(c("Pierre Vinken, 61 years old, will join the board as a ",
             "nonexecutive director Nov. 29.\n",
             "Mr. Vinken is chairman of Elsevier N.V., ",
             "the Dutch publishing group."),
           collapse = "")
s <- as.String(s)
s

sent_token_annotator <- Maxent_Sent_Token_Annotator()
word_token_annotator <- Maxent_Word_Token_Annotator()
a2 <- annotate(s, list(sent_token_annotator, word_token_annotator))
pos_tag_annotator <- Maxent_POS_Tag_Annotator()
pos_tag_annotator
a3 <- annotate(s, pos_tag_annotator, a2)
a3
a3w <- subset(a3, type == "word")
a3w

This gives me the output:

id type     start end features
1 sentence     1  84 constituents=<<integer,18>>
2 sentence    86 153 constituents=<<integer,13>>
3 word         1   6 POS=NNP
4 word         8  13 POS=NNP
5 word        14  14 POS=,

And so on.

My question is, how do I extract for example the number of nouns per observation so I can use this for further analysis.

Thanks!


Solution

  • I don't use openNLP, but use different packages for POS tagging. If someone has an answer for openNLP that can help you that would be great.

    But I will give you a solution using udpipe. You might find it useful.

    s <- paste(c("Pierre Vinken, 61 years old, will join the board as a ",
                 "nonexecutive director Nov. 29.\n",
                 "Mr. Vinken is chairman of Elsevier N.V., ",
                 "the Dutch publishing group."),
               collapse = "")
    
    library(udpipe)
    
    if (file.exists("english-ud-2.0-170801.udpipe")) 
      ud_model <- udpipe_load_model(file = "english-ud-2.0-170801.udpipe") else {
        ud_model <- udpipe_download_model(language = "english")
        ud_model <- udpipe_load_model(ud_model$file_model)
    }
    
    x <- udpipe_annotate(ud_model, s)
    x <- as.data.frame(x)
    table(x$upos)
    
      ADJ   ADP   AUX   DET  NOUN   NUM PROPN PUNCT  VERB 
        2     2     2     3     6     2     8     5     1 
    

    edit: counts per sentence:

    table(x$sentence_id, x$upos)
        ADJ ADP AUX DET NOUN NUM PROPN PUNCT VERB
      1   2   1   1   2    3   2     3     3    1
      2   0   1   1   1    3   0     5     2    0
    

    When you create a data.frame from x after the annotations, you have access to doc_id, paragraph_id, sentence_id, etc etc. You can create a whole range of statistics per document / sentence etc. The vignettes give a good overview of what is possible.