I would like to keep such 2-3 word phrases (i.e.features) within my dfm that have a PMI value greater than 3x the number of words in the phrase*.
PMI is hereby defined as: pmi(phrase) = log(p(phrase)/Product(p(word))
with p(phrase): the probability of the phrase based on its relative frequency Product(p(word): the product of the probabilities of each word in the phrase.
Thus far I used the following code, however the PMI values do not seem to be correct, but I am not able to find the issue:
#creating dummy data
id <- c(1:5)
text <- c("positiveemoticon my name is positiveemoticon positiveemoticon i love you", "hello dont", "i love you", "i love you", "happy birthday")
ids_text_clean_test <- data.frame(id, text)
ids_text_clean_test$id <- as.character(ids_text_clean_test$id)
ids_text_clean_test$text <- as.character(ids_text_clean_test$text)
test_corpus <- corpus(ids_text_clean_test[["text"]], docnames = ids_text_clean_test[["id"]])
tokens_all_test <- tokens(test_corpus, remove_punct = TRUE)
## Create a document-feature matrix(dfm)
doc_phrases_matrix_test <- dfm(tokens_all_test, ngrams = 2:3) #extracting two- and three word phrases
doc_phrases_matrix_test
# calculating the pointwise mututal information for each phrase to identify phrases that occur at rates much higher than chance
tcmrs = Matrix::rowSums(doc_phrases_matrix_test) #number of words per user
tcmcs = Matrix::colSums(doc_phrases_matrix_test) #counts of each phrase
N = sum(tcmrs) #number of total words used
colp = tcmcs/N #proportion of the phrases by total phrases
rowp = tcmrs/N #proportion of each users' words used by total words used
pp = doc_phrases_matrix_test@p + 1
ip = doc_phrases_matrix_test@i + 1
tmpx = rep(0,length(doc_phrases_matrix_test@x)) # new values go here, just a numeric vector
# iterate through sparse matrix:
for (i in 1:(length(doc_phrases_matrix_test@p) - 1) ) {
ind = pp[i]:(pp[i + 1] - 1)
not0 = ip[ind]
icol = doc_phrases_matrix_test@x[ind]
tmp = log( (icol/N) / (rowp[not0] * colp[i] )) # PMI
tmpx[ind] = tmp
}
doc_phrases_matrix_test@x = tmpx
doc_phrases_matrix_test
I believe the PMI should not vary within one phrase by user but I thought it would be easier to apply the PMI to the dfm directly so it is easier to subset it based on the features PMI.
An alternative approach I tried is to apply the PMI to the features directly:
test_pmi <- textstat_keyness(doc_phrases_matrix_test, measure = "pmi",
sort = TRUE)
test_pmi
However, firstly, here I am getting a warning Warning that NaNs were produced and secondly, I don't understand the PMI values (e.g. why are there negative values)?
Does anyone have a better idea how to exctract features based on their PMI values as defined above?
Any hint is highly appreciated :)
*following Park et al.(2015)
You can use the following R code which uses the udpipe R package to get what you are asking. Example on a tokenised data.frame which is part of udpipe
library(udpipe)
data(brussels_reviews_anno, package = "udpipe")
x <- subset(brussels_reviews_anno, language %in% "fr")
## find keywords with PMI > 3
keyw <- keywords_collocation(x, term = "lemma",
group = c("doc_id", "sentence_id"), ngram_max = 3, n_min = 10)
keyw <- subset(keyw, pmi > 3)
## recodes to keywords
x$term <- txt_recode_ngram(x$lemma, compound = keyw$keyword, ngram = keyw$ngram)
## create DTM
dtm <- document_term_frequencies(x = x$term, document = x$doc_id)
dtm <- document_term_matrix(dtm)
If you want to get a dataset in a similar structure as x. Just use udpipe(text, "english") or any language of your choice. If you want to use quanteda for tokenisation, you can still get it into a nicer enriched data.frame - example of this is given here and here. Look to the help of the udpipe R package it has many vignettes (?udpipe).
Note that PMI is usefull, it is many more usefull to use the dependency parsing output of the udpipe R package. If you look at dep_rel field you will find there categories which identify multi-word expressions (e.g. dep_rel fixed/flat/compound are multi-word expressions as defined at http://universaldependencies.org/u/dep/index.html) you could also use these to put them in your document/term/matrix