Search code examples
rtext-parsingquanteda

Parsing Speech Transcripts Using R


I have several large transcripts of speeches that I am trying to get into a data frame format where each row represents a speech/utterance and the corresponding speaker name is in a column.

Here is a snapshot of the text as it is currently structured:

"sr. presidente domínguez.- "
" Tiene la palabra el señor diputado por Buenos Aires."
""
"sr. ATANASOF, ALFREDO NESTOR.- "
" Señor presidente: también quiero adherir en nombre del Frente Peronista a este homenaje al Gringo Soria. "
"   Me tocó compartir con él  muchos años de trabajo en esta Cámara y luego funciones en el Poder Ejecutivo nacional. Realmente, durante esos años pude descubrir los valores del Gringo: un gran militante y peronista, quien siempre anteponía la resolución de los temas a las situaciones de conflicto."
"   Hemos sentido mucho dolor cuando nos enteramos de esta desgraciada situación. Por ello, en nombre de nuestro bloque, quiero adherir al homenaje que hacemos a un amigo. Justamente, el Gringo Soria era un amigo para mí. (Aplausos.)"
""

I have used the following loop to try and parse the text in a way so that each line represents a speaker and the corresponding speech/utterance:

test <- readtext(text)
testtxt <- test$text

trans.prep <- function(testtxt) {

testtxt <- gsub("\\s{2,}", " ", testtxt, perl = T)
#gets rid of double spaces and replaces them with single spaces

testtxt <- subset(testtxt, nchar(testtxt) > 0)
#gets rid of lines that are empty (length of line is zero)

#collapse down to utterances

my.line <- 1

while (my.line <= length (testtxt)) {

  utterance <- length(grep(".-", testtxt[my.line], perl = T))
  if (utterance == 1) {my.line <- my.line + 1 }
  if (utterance == 0) {testtext[my.line-1] <-paste(testtext[(my.line-1):my.line], collapse = " ")
    testtext <- testtext[-my.line]} }
   testtxt <- subset(testtxt, nchar(testtxt) > 0)

  return(testtxt)}

The loop should give back the parsed transcript but when I run the loop nothing happens and R provides no error message.

I'm new to parsing and still a novice with R so I'm certain that is part of my problem. Any advice would be greatly appreciated.


Solution

  • It's hard to know exactly what your input format is, since the example is not fully reproducible, but let's assume that your text as printed in the question are lines from a single text file. Here, I saved it (without the double quotes) as such a text file, example.txt.

    We designed corpus_segment() for this use case.

    library("quanteda")
    ## Package version: 1.3.14
    
    example_corpus <- readtext::readtext("example.txt") %>%
      corpus()
    summary(example_corpus)
    ## Corpus consisting of 1 document:
    ## 
    ##         Text Types Tokens Sentences
    ##  example.txt    93    141         8
    ## 
    ## Source: /private/var/folders/1v/ps2x_tvd0yg0lypdlshg_vwc0000gp/T/RtmpXk3YHc/reprex1325b73a1073d/* on x86_64 by kbenoit
    ## Created: Wed Jan  9 19:09:55 2019
    ## Notes:
    
    example_corpus2 <-
      corpus_segment(example_corpus, pattern = "sr\\..*-", valuetype = "regex")
    summary(example_corpus2)
    ## Corpus consisting of 2 documents:
    ## 
    ##           Text Types Tokens Sentences                        pattern
    ##  example.txt.1    10     10         1     sr. presidente domínguez.-
    ##  example.txt.2    80    117         7 sr. ATANASOF, ALFREDO NESTOR.-
    ## 
    ## Source: /private/var/folders/1v/ps2x_tvd0yg0lypdlshg_vwc0000gp/T/RtmpXk3YHc/reprex1325b73a1073d/* on x86_64 by kbenoit
    ## Created: Wed Jan  9 19:09:55 2019
    ## Notes: corpus_segment.corpus(example_corpus, pattern = "sr\\..*-", valuetype = "regex")
    

    We can tidy that up a bit.

    # clean up pattern by removing unneeded elements
    docvars(example_corpus2, "pattern") <-
      stringi::stri_replace_all_fixed(docvars(example_corpus2, "pattern"),
        c("sr. ", ".-"), "",
        vectorize_all = FALSE
      )
    
    names(docvars(example_corpus2))[1] <- "speaker"
    
    summary(example_corpus2)
    ## Corpus consisting of 2 documents:
    ## 
    ##           Text Types Tokens Sentences                  speaker
    ##  example.txt.1    10     10         1     presidente domínguez
    ##  example.txt.2    80    117         7 ATANASOF, ALFREDO NESTOR
    ## 
    ## Source: /private/var/folders/1v/ps2x_tvd0yg0lypdlshg_vwc0000gp/T/RtmpXk3YHc/reprex1325b73a1073d/* on x86_64 by kbenoit
    ## Created: Wed Jan  9 19:09:55 2019
    ## Notes: corpus_segment.corpus(example_corpus, pattern = "sr\\..*-", valuetype = "regex")