Is there any way to automatically create an indicator variable when combining quanteda corpora (using the plus operator) that can label which source corpus the document came from? For instance, say you have two corpora, corpus1 and corpus2. You run the following:
corpus 3 <- corpus1 + corpus2
I'd like to find some way to create a new docvar that indicates which corpus each document in corpus3 comes from. Any ideas?
No automatic way at the moment, but the easiest method is to create a corpus identifier before adding the corpora.
library("quanteda")
# Loading required package: quanteda
# Package version: 1.3.4
c1 <- corpus(c(d11 = "C1 Doc one.", d12 = "C1 Doc two."))
c2 <- corpus(c(d21 = "C2 Doc one.", d22 = "C2 Doc two.", d23 = "C2 Doc 3"))
docvars(c1, "corpusid") <- 1
docvars(c2, "corpusid") <- 2
cc <- c1 + c2
summary(cc)
# Corpus consisting of 5 documents:
#
# Text Types Tokens Sentences corpusid
# d11 4 4 1 1
# d12 4 4 1 1
# d21 4 4 1 2
# d22 4 4 1 2
# d23 3 3 1 2
#
# Source: Combination of corpuses c1 and c2
# Created: Sun Jul 29 09:37:28 2018
# Notes: