Search code examples
rtext-miningquanteda

Add new document to R corpus to find unique words


I have a corpus of speeches and I would like to identify the unique words within one kind of speeches.

This is what I did, I extracted two corpora from the larger one. In the script EUP_control_corpus and IMF_control_corpus. I made IMF_control_corpus into one text file which I want to combine with EUP_control_corpus, then by using tf.idf I want to find out which terms are unique for the IMF speeches in relation to EUP speeches.

However, I'm stuck at the part of adding to (combining with) a corpus. To me it seems like this should be very simple so I don't understand why I can't find anything on it. Is it so simple that no-one has asked this question?

I tried making both into a dfm and then joining them, or turning the text file back into a corpus to join them, but in both instances, the single text file turned out to have, once more, a great number of documents.

  #Create date format
base_corpus$documents$int_date <- 
  as.Date( base_corpus$documents$date,  format = "%d-%m-%Y")
head(as.Date( base_corpus$documents$date,  format = "%d-%m-%Y"))


#Select pre-crisis EUP speeches for control group
EUP_control_corpus<- 
  corpus_subset(base_corpus, country == "European Parliament" & int_date < as.Date( '31-12-2012', format = "%d-%m-%Y"))
head(docnames(EUP_control_corpus), 50)
ndoc(EUP_control_corpus)


#Create dfm out of EUP corpus
EUP_control_dfm <- 
  dfm(EUP_control_corpus, tolower = TRUE, stem = FALSE)
ndoc(EUP_control_dfm)


#Select pre-crisis IMF speeches for control group
IMF_control_corpus<- 
  corpus_subset(base_corpus, country == "International Monetary Fund" & int_date < as.Date( '31-12-2012', format = "%d-%m-%Y"))
head(docnames(IMF_control_corpus), 50)
ndoc(IMF_control_corpus)


#Combine IMF_control_corpus into one text
IMF_control_text<-
  texts(corpus(texts(IMF_control_corpus, groups = "texts")))
IMF_control_dfm<-
  dfm(IMF_control_text)
ndoc(IMF_control_dfm)


#Add IMF_control_text to EUP_control_dfm
plus_dfm<-
  dfm(rbind(EUP_control_dfm, IMF_control_dfm))
ndoc((plus_dfm))


#Add IMF_control_text to EUP_control_corpus/ doesn't work, make text into single text corpus and then add?
total_control_corpus<-
  corpus(EUP_control_corpus, IMF_control_text)
ndoc(total_control_corpus)

I have the idea that the group function in quanteda could be useful to do this in another way, but I decided to post the question first as have been on the search for a couple of days already.

Thank you for reading this question.


Solution

  • This is not a question with a reproducible example, so it is hard to provide a correct answer. Here are some suggestions:

    1. Create a new document variable called control that takes on one of two values, IMF or EU. Use this using the conditionals that you were previously using with the corpus_subset() command. From that, you can easily create a dfm that will continue to include this docvar, or you can use the groups = "control" argument to dfm() to collapse the counts by the values of this variable.

    2. Use docvars(thecorpus, "thevariable") <- newvalue instead of addressing the inner contents of the corpus object. That method is not stable since we may change the internal contents of the corpus at any time.