I am working with a number texts using the quanteda package. My texts contain tags in them, some with unique values like URLs. I want remove not only the tags but everything inside the tags as well.
Example:
<oa>
</oa>
<URL: http://in.answers.yahoo.com/question/index;_ylt=Ap2wvXm2aeRQKHO.HeDgTfneQHRG;_ylv=3?qid=1006042400700>
<q>
<ad>
</ad>
I'm not sure how to remove them while working with the quanteda
package. It seems to me like the dfm
function would be the place to use it, I don't think stopwords
will work because of the unique URLs. I can use the following gsub
with regex expression to successfully target the tags I want to remove:
x <- gsub("<.*?>", "", y)
I've gone through the gfm documentation and have tried a few things with the remove and value type arguments, but perhaps I don't understand the documentation very well.
Also as shown by the answer in this question I tried the dfm_select
function but no dice as well.
Here is my code:
library(readtext)
library(quanteda)
#create directory
data_dir <- list.files(pattern="*.txt", recursive = TRUE, full.names = TRUE)
#create corpus
micusp_corpus <- corpus(readtext(data_dir))
#add field 'region'
docvars(micusp_corpus, "Region") <- gsub("(\\w{6})\\..*?$", "", rownames(micusp_corpus$documents))
#create document feature matrix
micusp_dfm <- dfm(micusp_corpus, groups = "Region", remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE)
#try to remove tags
micusp_dfm <- dfm_select(micusp_dfm, "<.*?>", selection = "remove", valuetype = "regex")
#show top tokens (note the appearence of the tag content "oa")
textstat_frequency(micusp_dfm, n=10)
While your question does not provide a reproducible example, I think I can help. You want to clean the texts that go into your corpus, before you reach the dfm construction stage. Replace the #create corpus
line with this:
# read texts, remove tags, and create the corpus
tmp <- readtext(data_dir)
tmp$text <- gsub("<.*?>", "", tmp$text)
micusp_corpus <- corpus(tmp)