I want to identify major n-grams in a bunch of academic papers, including n-grams with nested stopwords, but not n-grams with leading or trailing stopwords.
I have about 100 pdf files. I converted them to plain-text files through an Adobe batch command and collected them within a single directory. From there I use R. (It's a patchwork of code because I'm just getting started with text mining.)
My code:
library(tm)
# Make path for sub-dir which contains corpus files
path <- file.path(getwd(), "txt")
# Load corpus files
docs <- Corpus(DirSource(path), readerControl=list(reader=readPlain, language="en"))
#Cleaning
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removePunctuation)
# Merge corpus (Corpus class to character vector)
txt <- c(docs, recursive=T)
# Find trigrams (but I might look for other ngrams as well)
library(quanteda)
myDfm <- dfm(txt, ngrams = 3)
# Remove sparse features
myDfm <- dfm_trim(myDfm, min_count = 5)
# Display top features
topfeatures(myDfm)
# as_well_as of_the_ecosystem in_order_to a_business_ecosystem the_business_ecosystem strategic_management_journal
#603 543 458 431 431 359
#in_the_ecosystem academy_of_management the_role_of the_number_of
#336 311 289 276
For example, in the top ngrams sample provided here, I'd want to keep "academy of management", but not "as well as", nor "the_role_of". I'd like the code to work for any n-gram (preferably including less than 3-grams, although I understand it's simpler in this case to just remove stopwords first).
Here's how in quanteda: use dfm_remove()
, where the pattern you want to remove is the stopword list followed by the concatenator character, for the beginning and end of the expression. (Note here that for reproducibility, I have used a built-in text object.)
library("quanteda")
# remove for your own txt
txt <- data_char_ukimmig2010
(myDfm <- dfm(txt, remove_numbers = TRUE, remove_punct = TRUE, ngrams = 3))
## Document-feature matrix of: 9 documents, 5,518 features (88.5% sparse).
(myDfm2 <- dfm_remove(myDfm,
pattern = c(paste0("^", stopwords("english"), "_"),
paste0("_", stopwords("english"), "$")),
valuetype = "regex"))
## Document-feature matrix of: 9 documents, 1,763 features (88.6% sparse).
head(featnames(myDfm2))
## [1] "immigration_an_unparalleled" "bnp_can_solve" "solve_at_current"
## [4] "immigration_and_birth" "birth_rates_indigenous" "rates_indigenous_british"
You can read your pdfs using the readtext package, which also works just fine with quanteda using the above code.
library("readtext")
txt <- readtext("yourpdfolder/*.pdf") %>% corpus()