Search code examples
rquanteda

How to create a quanteda corpus from a data.frame with multiple columns for text?


lets say i have the following:

x10 = data.frame(id = c(1,2,3),vars =c('top','down','top'), 
     text1=c('this is text','so is this','and this is too.'),
     text2=c('we have more text here','and here too','and look at this, more text.'))

I want to create a dfm/corpus in quanteda using the following:

x1 = corpus(x10,docid_field='id',text_field=c(3:4),tolower=T) 

Obviously this errors out because text_field only takes a single column. Is there a better way I should go about handling this problem other than just building two corpuses? Can I build 2 then merge on id? Is that a thing?


Solution

  • First, let's recreate your data.frame without factoring the character values:

    x10 = data.frame(id = c(1,2,3), vars = c('top','down','top'), 
                     text1 = c('this is text', 'so is this', 'and this is too.'),
                     text2 = c('we have more text here', 'and here too', 'and look at this, more text.'),
                     stringsAsFactors = FALSE)
    

    Then we have two options.

    Method 1: Reshape to "long" format and create a single corpus

    "Melt" the data first so there is a single column, and then import as a corpus. (An alternative is the tidy::gather().)

    x10b <- reshape2::melt(x10, id.vars = c("id", "vars"), 
                           measure.vars = c("text1", "text2"),
                           variable.name = "doc_id", value.name = "text")
    
    # because corpus() takes document names from row names, by default 
    row.names(x10b) <- paste(x10b$doc_id, x10b$id, sep = "_")
    
    x10b
    #         id vars doc_id                         text
    # text1_1  1  top  text1                 this is text
    # text1_2  2 down  text1                   so is this
    # text1_3  3  top  text1             and this is too.
    # text2_1  1  top  text2       we have more text here
    # text2_2  2 down  text2                 and here too
    # text2_3  3  top  text2 and look at this, more text.
    
    x10_corpus <- corpus(x10b)
    summary(x10_corpus)
    # Corpus consisting of 6 documents:
    #     
    #    Text Types Tokens Sentences id vars doc_id
    # text1_1     3      3         1  1  top  text1
    # text1_2     3      3         1  2 down  text1
    # text1_3     5      5         1  3  top  text1
    # text2_1     5      5         1  1  top  text2
    # text2_2     3      3         1  2 down  text2
    # text2_3     8      8         1  3  top  text2
    # 
    # Source:  /Users/kbenoit/Dropbox (Personal)/GitHub/lse-my459/assignment-2/* on x86_64 by kbenoit
    # Created: Tue Feb  6 19:06:07 2018
    # Notes:    
    

    Method 2: Make two corpus objects and combine

    Here, we create two corpus objects separately and combine them using the + operator.

    x10_corpus2 <- 
        corpus(x10[, -which(names(x10)=="text2")], text_field = "text1") +
        corpus(x10[, -which(names(x10)=="text1")], text_field = "text2")
    summary(x10_corpus2)
    # Corpus consisting of 6 documents:
    #     
    #   Text Types Tokens Sentences id vars
    #  text1     3      3         1  1  top
    #  text2     3      3         1  2 down
    #  text3     5      5         1  3  top
    # text11     5      5         1  1  top
    # text21     3      3         1  2 down
    # text31     8      8         1  3  top
    # 
    # Source:  Combination of corpuses corpus(x10[, -which(names(x10) == "text2")], text_field = "text1") and corpus(x10[, -which(names(x10) == "text1")], text_field = "text2")
    # Created: Tue Feb  6 19:14:14 2018
    # Notes: 
    

    You could also at this stage use docnames(x10_corpus2) <- to reassign the docnames to be more like the first method.