Search code examples
rregexstringsplittext-mining

How to split a text / string with pattern of upper case letters?


I’m looking to Split a text according to each interlocutor.

The original text has this form:

this is a speech text. FIRST PERSON: hi all, thank you for coming. SECOND PERSON: thank you for inviting us. TERCER PERSONA QUE SE LLAMA PEDRO: soy de acuerdo. CUARTA PERSONA (JOHN): Hi. How are you

I’m searching for a final result like this:

first column: FIRST PERSON |SECOND PERSON | TERCER PERSONA QUE SE LLAMA PEDRO | CUARTA PERSONA (JOHN)

second column: hi all, thank you for coming | thank you for inviting us | soy de acuerdo | Hi. How are you

The final result can also be in other format or reshaped.

The Pattern to split is one or more Upper Word and a ":", but one difficulty is that the pattern in capital letters can have optional characters like: ():,;

In fact the original text that I am searching to split is this one: https://lopezobrador.org.mx/2021/01/14/version-estenografica-de-la-conferencia-de-prensa-matutina-del-presidente-andres-manuel-lopez-obrador-458/

I have tried different things using stringr rebus and qdap. First trying this pattern:

pattern_mayusc <- UPPER %R% one_or_more(UPPER) %R% optional(") ") %R% ":"

Following I tried to extract a vector with the name of each interlocutor to use them as pattern next:

mayuscula<-sapply(str_extract_all(text, ".([A-Z]+:)"), paste, collapse= ' ')

I am close to obtain what I desire but cannot achieve it. Anyone to help me? Thanks a lot in advance


Solution

  • You may use strsplit on a pattern that matches either : preceded by a sequence of words with any upper case letters \p{Lu}, spaces (\s) and parentheses (and more if you need), or (|) the space, followed by the same sequence. We want the first element from the resulting list and cleaned with trimws. The result is an alternating pattern of speaker and text, which we can easily convert into a two-column matrix by row.

    pat <- r"{(?>\p{Lu}+?\s?)+\(?\p{Lu}+\)?\K(:)|(?<!\w)(\s)(?=\p{Lu}{2,})}"
    # pat <- "(?>\\p{Lu}+?\\s?)+\\(?\\p{Lu}+\\)?\\K(:)|(?<!\\w)(\\s)(?=\\p{Lu}{2,})"  ## for R < 4.0.0
    
    tmp <- trimws(el(strsplit(x, pat, perl=TRUE)))[-1]
    res <- matrix(tmp, ncol=2, byrow=TRUE)
    res
    #      [,1]                                 [,2]                           
    # [1,] "FIRST PERSON"                       "hi all, thank you for coming."
    # [2,] "SECOND PERSON"                      "thank you for inviting us."   
    # [3,] "TERCER PERSONA QUE SE LLAMA ANDRÉS" "soy de acuerdo."              
    # [4,] "CUARTA PERSONA (JOHN)"              "Hi. How are you?"             
    # [5,] "ANDRÉS"                             "Hola buenos días!"   
    

    See the regex demo


    Data:

    x <- "this is a speech text. FIRST PERSON: hi all, thank you for coming. SECOND PERSON: thank you for inviting us. TERCER PERSONA QUE SE LLAMA ANDRÉS: soy de: acuerdo. CUARTA PERSONA (JOHN): Hi. How are you? ANDRÉS: Hola buenos días!"