I have several large transcripts of speeches that I am trying to get into a data frame format where each row represents a speech/utterance and the corresponding speaker name is in a column.
Here is a snapshot of the text as it is currently structured:
"sr. presidente domínguez.- "
" Tiene la palabra el señor diputado por Buenos Aires."
""
"sr. ATANASOF, ALFREDO NESTOR.- "
" Señor presidente: también quiero adherir en nombre del Frente Peronista a este homenaje al Gringo Soria. "
" Me tocó compartir con él muchos años de trabajo en esta Cámara y luego funciones en el Poder Ejecutivo nacional. Realmente, durante esos años pude descubrir los valores del Gringo: un gran militante y peronista, quien siempre anteponía la resolución de los temas a las situaciones de conflicto."
" Hemos sentido mucho dolor cuando nos enteramos de esta desgraciada situación. Por ello, en nombre de nuestro bloque, quiero adherir al homenaje que hacemos a un amigo. Justamente, el Gringo Soria era un amigo para mí. (Aplausos.)"
""
I have used the following loop to try and parse the text in a way so that each line represents a speaker and the corresponding speech/utterance:
test <- readtext(text)
testtxt <- test$text
trans.prep <- function(testtxt) {
testtxt <- gsub("\\s{2,}", " ", testtxt, perl = T)
#gets rid of double spaces and replaces them with single spaces
testtxt <- subset(testtxt, nchar(testtxt) > 0)
#gets rid of lines that are empty (length of line is zero)
#collapse down to utterances
my.line <- 1
while (my.line <= length (testtxt)) {
utterance <- length(grep(".-", testtxt[my.line], perl = T))
if (utterance == 1) {my.line <- my.line + 1 }
if (utterance == 0) {testtext[my.line-1] <-paste(testtext[(my.line-1):my.line], collapse = " ")
testtext <- testtext[-my.line]} }
testtxt <- subset(testtxt, nchar(testtxt) > 0)
return(testtxt)}
The loop should give back the parsed transcript but when I run the loop nothing happens and R provides no error message.
I'm new to parsing and still a novice with R so I'm certain that is part of my problem. Any advice would be greatly appreciated.
It's hard to know exactly what your input format is, since the example is not fully reproducible, but let's assume that your text as printed in the question are lines from a single text file. Here, I saved it (without the double quotes) as such a text file, example.txt
.
We designed corpus_segment()
for this use case.
library("quanteda")
## Package version: 1.3.14
example_corpus <- readtext::readtext("example.txt") %>%
corpus()
summary(example_corpus)
## Corpus consisting of 1 document:
##
## Text Types Tokens Sentences
## example.txt 93 141 8
##
## Source: /private/var/folders/1v/ps2x_tvd0yg0lypdlshg_vwc0000gp/T/RtmpXk3YHc/reprex1325b73a1073d/* on x86_64 by kbenoit
## Created: Wed Jan 9 19:09:55 2019
## Notes:
example_corpus2 <-
corpus_segment(example_corpus, pattern = "sr\\..*-", valuetype = "regex")
summary(example_corpus2)
## Corpus consisting of 2 documents:
##
## Text Types Tokens Sentences pattern
## example.txt.1 10 10 1 sr. presidente domínguez.-
## example.txt.2 80 117 7 sr. ATANASOF, ALFREDO NESTOR.-
##
## Source: /private/var/folders/1v/ps2x_tvd0yg0lypdlshg_vwc0000gp/T/RtmpXk3YHc/reprex1325b73a1073d/* on x86_64 by kbenoit
## Created: Wed Jan 9 19:09:55 2019
## Notes: corpus_segment.corpus(example_corpus, pattern = "sr\\..*-", valuetype = "regex")
We can tidy that up a bit.
# clean up pattern by removing unneeded elements
docvars(example_corpus2, "pattern") <-
stringi::stri_replace_all_fixed(docvars(example_corpus2, "pattern"),
c("sr. ", ".-"), "",
vectorize_all = FALSE
)
names(docvars(example_corpus2))[1] <- "speaker"
summary(example_corpus2)
## Corpus consisting of 2 documents:
##
## Text Types Tokens Sentences speaker
## example.txt.1 10 10 1 presidente domínguez
## example.txt.2 80 117 7 ATANASOF, ALFREDO NESTOR
##
## Source: /private/var/folders/1v/ps2x_tvd0yg0lypdlshg_vwc0000gp/T/RtmpXk3YHc/reprex1325b73a1073d/* on x86_64 by kbenoit
## Created: Wed Jan 9 19:09:55 2019
## Notes: corpus_segment.corpus(example_corpus, pattern = "sr\\..*-", valuetype = "regex")