I’m looking to Split a text according to each interlocutor.
The original text has this form:
this is a speech text. FIRST PERSON: hi all, thank you for coming. SECOND PERSON: thank you for inviting us. TERCER PERSONA QUE SE LLAMA PEDRO: soy de acuerdo. CUARTA PERSONA (JOHN): Hi. How are you
I’m searching for a final result like this:
first column: FIRST PERSON |SECOND PERSON | TERCER PERSONA QUE SE LLAMA PEDRO | CUARTA PERSONA (JOHN)
second column: hi all, thank you for coming | thank you for inviting us | soy de acuerdo | Hi. How are you
The final result can also be in other format or reshaped.
The Pattern to split is one or more Upper Word and a ":", but one difficulty is that the pattern in capital letters can have optional characters like: ():,;
In fact the original text that I am searching to split is this one: https://lopezobrador.org.mx/2021/01/14/version-estenografica-de-la-conferencia-de-prensa-matutina-del-presidente-andres-manuel-lopez-obrador-458/
I have tried different things using stringr rebus and qdap. First trying this pattern:
pattern_mayusc <- UPPER %R% one_or_more(UPPER) %R% optional(") ") %R% ":"
Following I tried to extract a vector with the name of each interlocutor to use them as pattern next:
mayuscula<-sapply(str_extract_all(text, ".([A-Z]+:)"), paste, collapse= ' ')
I am close to obtain what I desire but cannot achieve it. Anyone to help me? Thanks a lot in advance
You may use strsplit
on a pattern that matches either :
preceded by a sequence of words with any upper case letters \p{Lu}
, spaces (\s
) and parentheses (and more if you need), or (|
) the space, followed by the same sequence. We want the first el
ement from the resulting list and cleaned with trimws
. The result is an alternating pattern of speaker and text, which we can easily convert into a two-column matrix
by row.
pat <- r"{(?>\p{Lu}+?\s?)+\(?\p{Lu}+\)?\K(:)|(?<!\w)(\s)(?=\p{Lu}{2,})}"
# pat <- "(?>\\p{Lu}+?\\s?)+\\(?\\p{Lu}+\\)?\\K(:)|(?<!\\w)(\\s)(?=\\p{Lu}{2,})" ## for R < 4.0.0
tmp <- trimws(el(strsplit(x, pat, perl=TRUE)))[-1]
res <- matrix(tmp, ncol=2, byrow=TRUE)
res
# [,1] [,2]
# [1,] "FIRST PERSON" "hi all, thank you for coming."
# [2,] "SECOND PERSON" "thank you for inviting us."
# [3,] "TERCER PERSONA QUE SE LLAMA ANDRÉS" "soy de acuerdo."
# [4,] "CUARTA PERSONA (JOHN)" "Hi. How are you?"
# [5,] "ANDRÉS" "Hola buenos días!"
See the regex demo
Data:
x <- "this is a speech text. FIRST PERSON: hi all, thank you for coming. SECOND PERSON: thank you for inviting us. TERCER PERSONA QUE SE LLAMA ANDRÉS: soy de: acuerdo. CUARTA PERSONA (JOHN): Hi. How are you? ANDRÉS: Hola buenos días!"