r machine-learning statistics nlp quanteda

Language based processing in R: Selecting features in dfm with certain pointwise mutual information (PMI) value

I would like to keep such 2-3 word phrases (i.e.features) within my dfm that have a PMI value greater than 3x the number of words in the phrase*.

PMI is hereby defined as: pmi(phrase) = log(p(phrase)/Product(p(word))

with p(phrase): the probability of the phrase based on its relative frequency Product(p(word): the product of the probabilities of each word in the phrase.

Thus far I used the following code, however the PMI values do not seem to be correct, but I am not able to find the issue:

#creating dummy data
id <- c(1:5)
text <- c("positiveemoticon my name is positiveemoticon positiveemoticon i love you", "hello dont", "i love you", "i love you", "happy birthday")
ids_text_clean_test <- data.frame(id, text)
ids_text_clean_test$id <- as.character(ids_text_clean_test$id)
ids_text_clean_test$text <- as.character(ids_text_clean_test$text)

test_corpus <- corpus(ids_text_clean_test[["text"]], docnames = ids_text_clean_test[["id"]])

tokens_all_test <- tokens(test_corpus, remove_punct = TRUE)

## Create a document-feature matrix(dfm)
doc_phrases_matrix_test <- dfm(tokens_all_test, ngrams = 2:3) #extracting two- and three word phrases
doc_phrases_matrix_test

# calculating the pointwise mututal information for each phrase to identify phrases that occur at rates much higher than chance
tcmrs = Matrix::rowSums(doc_phrases_matrix_test) #number of words per user
tcmcs = Matrix::colSums(doc_phrases_matrix_test) #counts of each phrase
N = sum(tcmrs) #number of total words used 
colp = tcmcs/N #proportion of the phrases by total phrases
rowp = tcmrs/N #proportion of each users' words used by total words used
pp = doc_phrases_matrix_test@p + 1
ip = doc_phrases_matrix_test@i + 1
tmpx = rep(0,length(doc_phrases_matrix_test@x)) # new values go here, just a numeric vector
# iterate through sparse matrix:
for (i in 1:(length(doc_phrases_matrix_test@p) - 1) ) {
  ind = pp[i]:(pp[i + 1] - 1)
  not0 = ip[ind]
  icol = doc_phrases_matrix_test@x[ind]
  tmp = log( (icol/N) / (rowp[not0] * colp[i] )) # PMI
  tmpx[ind] = tmp
}

doc_phrases_matrix_test@x = tmpx
doc_phrases_matrix_test

I believe the PMI should not vary within one phrase by user but I thought it would be easier to apply the PMI to the dfm directly so it is easier to subset it based on the features PMI.

An alternative approach I tried is to apply the PMI to the features directly:

test_pmi <- textstat_keyness(doc_phrases_matrix_test,  measure =  "pmi",
                             sort = TRUE)
test_pmi

However, firstly, here I am getting a warning Warning that NaNs were produced and secondly, I don't understand the PMI values (e.g. why are there negative values)?

Does anyone have a better idea how to exctract features based on their PMI values as defined above?

Any hint is highly appreciated :)

*following Park et al.(2015)

Solution

You can use the following R code which uses the udpipe R package to get what you are asking. Example on a tokenised data.frame which is part of udpipe

library(udpipe) 
data(brussels_reviews_anno, package = "udpipe") 
x <- subset(brussels_reviews_anno, language %in% "fr") 

## find keywords with PMI > 3 
keyw <- keywords_collocation(x, term = "lemma", 
                             group = c("doc_id", "sentence_id"), ngram_max = 3, n_min = 10) 
keyw <- subset(keyw, pmi > 3) 

## recodes to keywords 
x$term <- txt_recode_ngram(x$lemma, compound = keyw$keyword, ngram = keyw$ngram) 
## create DTM 
dtm <- document_term_frequencies(x = x$term, document = x$doc_id) 
dtm <- document_term_matrix(dtm)

If you want to get a dataset in a similar structure as x. Just use udpipe(text, "english") or any language of your choice. If you want to use quanteda for tokenisation, you can still get it into a nicer enriched data.frame - example of this is given here and here. Look to the help of the udpipe R package it has many vignettes (?udpipe).

Note that PMI is usefull, it is many more usefull to use the dependency parsing output of the udpipe R package. If you look at dep_rel field you will find there categories which identify multi-word expressions (e.g. dep_rel fixed/flat/compound are multi-word expressions as defined at http://universaldependencies.org/u/dep/index.html) you could also use these to put them in your document/term/matrix