Search code examples
rnlptext-analysis

Count POS Tags by column


I am trying to count all Part-Of-Speech tags in a row and sum it up.

By now I reached two outputs:

1) The/DT question/NN was/VBD ,/, what/WP are/VBP you/PRP going/VBG to/TO cut/VB ?/.

2) c("DT", "NN", "VBD", ",", "WP", "VBP", "PRP", "VBG", "TO", "VB", ".")

In this particular example desirable output is:

        DT  NN  VBD  WP  VBP  PRP   VBG   TO   VB
1 doc   1   1    1   1    1    1     1     1    1

But since I want to create it for the whole column in dataframe I want to see there 0 values as well in a columns, which corresponds to a POS tag which was not used in this sentence.

Example:

1 doc = "The/DT question/NN was/VBD ,/, what/WP are/VBP you/PRP going/VBG to/TO cut/VB ?/" 

2 doc = "Response/NN ?/."

Output:

        DT  NN  VBD  WP  VBP  PRP   VBG   TO   VB
1 doc   1   1    1   1    1    1     1     1    1
2 doc   0   1    0   0    0    0     0     0    0

What I did by now:

library(stringr)
#Spliting into sentence based on carriage return

s <- unlist(lapply(df$sentence, function(x) { str_split(x, "\n")     }))

library(NLP)
library(openNLP)

tagPOS <-  function(x, ...) {
s <- as.String(x)
word_token_annotator <- Maxent_Word_Token_Annotator()
a2 <- Annotation(1L, "sentence", 1L, nchar(s))
a2 <- annotate(s, word_token_annotator, a2)
a3 <- annotate(s, Maxent_POS_Tag_Annotator(), a2)
a3w <- a3[a3$type == "word"]
POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
list(POStagged = POStagged, POStags = POStags)
}

result <- lapply(s,tagPOS)
result <- as.data.frame(do.call(rbind,result))

That's how I reached the output which was described at the beginning

I have tried to count occurrences like this: occurrences<-as.data.frame (table(unlist(result$POStags)))

But it count occurrences through the whole dataframe. I need to create new column to existing dataframe and count occurrences in the first column.

Can anyone help me please? :(


Solution

  • using tm is relatively painfree:

    dummy data

    require(tm)
       df <- data.frame(ID = c("doc1","doc2"), 
                        tags = c(paste("NN"), 
                                 paste("DT", "NN", "VBD", ",", "WP",   "VBP", "PRP", "VBG", "TO", "VB", ".")))
    

    make corpus and DocumentTermMatrix:

    corpus <- Corpus(VectorSource(df$tags))
    #default minimum wordlength is 3, so make sure you change this
    dtm <- DocumentTermMatrix(corpus, control= list(wordLengths=c(1,Inf)))
    
    #see what you've done
    inspect(dtm)
    
    <<DocumentTermMatrix (documents: 2, terms: 9)>>
    Non-/sparse entries: 10/8
    Sparsity           : 44%
    Maximal term length: 3
    Weighting          : term frequency (tf)
    Sample             :
        Terms
    Docs dt nn prp to vb vbd vbg vbp wp
       1  0  1   0  0  0   0   0   0  0
       2  1  1   1  1  1   1   1   1  1
    

    eta: if you dislike working with a dtm, you can coerce it to a dataframe:

    as.data.frame(as.matrix(dtm))
    
      nn dt prp to vb vbd vbg vbp wp
    1  1  0   0  0  0   0   0   0  0
    2  1  1   1  1  1   1   1   1  1
    

    eta2: Corpus creates a corpus of column df$tags only, and VectorSource assumes that each row in the data is one document, so the order of rows in the dataframe df, and the order of documents in the DocumentTermMatrix are the same: i can cbind df$ID onto the output dataframe. I do this using dplyr because i think it results in the most readable code (read %>% as "and then") :

    require(dplyr)
    result <- as.data.frame(as.matrix(dtm)) %>%
              bind_col(df$ID)