Search code examples
rcorpusquanteda

Quanteda - creating a corpus from a dataframe with multiple documents


First question here, so apologises for any faux-pas. I have a dataframe in R of 657 observations with 4 variables. Each observation is a speech or interview by the Australian Prime Minister. So the variables are:

  • date
  • title
  • URL
  • txt (full text of the speech/interview).

I'm trying to turn that into a corpus in Quanteda

I first tried corp <- corpus(all_content) but that gave me an error message

Error in corpus.data.frame(all_content) : 
  text_field column not found or invalid

This worked though: corp <- corpus(paste(all_content))

Then summary(corp) which gave me

Corpus consisting of 4 documents, showing 4 documents:

  Text Types  Tokens Sentences
 text1   243    1316         1
 text2  1095    6523         3
 text3   661    2630         1
 text4 25243 1867648     62572

My understand is that what this has done is effectively turn each column into a document, rather than each row?

If it matters, the txt variable is saved as a list. The code used to create each row is

```{r new_function}
scrape_speech <- function(url){
speech_page <- read_html(url)
     
     date <- speech_page %>% html_nodes(".date-display-single") %>% html_text() %>% dmy()
     title <- speech_page %>% html_nodes(".pagetitle") %>% html_text()
     txt <- speech_page %>% html_nodes("#block-system-main p") %>% html_text() %>% list()
     
     tibble (date = date, title = title, URL = url, txt=txt)}

I then used the map_dfr function to go through and scrape the 657 separate URLs.

Someone has suggested to me it is because the txt is saved as a list. I've tried without the list() in the function and I get 21,904 observations, as each paragraph in the full text document turns into a separate observation. I can turn that into a corpus with corp <- corpus(paste(all_content_not_list)) (Once again, without the paste I get the same error as above). That similarly gives me 4 documents in the corpus! summary(corp) Gives me

Corpus consisting of 4 documents, showing 4 documents:

  Text Types  Tokens Sentences
 text1   243   43810         1
 text2  1092  214970        25
 text3   657   87618         1
 text4 25243 1865687     62626

Thanks in advance Daniel


Solution

  • It's hard to address this problem exactly, because there is no reproducible example of your data.frame object, but if the structure contains the variables you list, then this should do it:

    corpus(all_content, text_field = "txt")
    

    See ?corpus.data.frame for details. If that does not do it, then try adding the output to your question of

    str(all_content)
    

    so that we can see in more detail what is in your all_content object.

    Edited following OP's addition of new data:

    OK so txt in your tibble is a list of character elements. You need to combine these into a single character in order use this as an input into corpus.data.frame(). Here's how:

    library("quanteda")
    ## Package version: 3.0.0
    ## Unicode version: 10.0
    ## ICU version: 61.1
    ## Parallel computing: 12 of 12 threads used.
    ## See https://quanteda.io for tutorials and examples.
    
    dframe <- structure(list(
      date = structure(18620, class = "Date"),
      title = " Prime Minister's Christmas Message to the ADF",
      URL = "https://www.pm.gov.au/media/prime-ministers-christmas-message-adf",
      txt = list(c(
        "G'day and Merry Christmas to everyone in our Australian Defence Force.",
        "You know, throughout our history, successive Australian governments... And this year was no different.",
        "God bless."
      ))
    ),
    row.names = c(NA, -1L),
    class = c("tbl_df", "tbl", "data.frame")
    )
    
    dframe$txt <- vapply(dframe$txt, paste, character(1), collapse = " ")
    
    corp <- corpus(dframe, text_field = "txt")
    print(corp, max_nchar = -1)
    ## Corpus consisting of 1 document and 3 docvars.
    ## text1 :
    ## "G'day and Merry Christmas to everyone in our Australian Defence Force. You know, throughout our history, successive Australian governments... And this year was no different. God bless."
    

    Created on 2021-04-08 by the reprex package (v1.0.0)