Search code examples
rregexregex-lookarounds

How to extract conversational utterances from single string


I have a conversation between several speakers recorded as a single string:

convers <- "Peter: hiya Mary: hi how wz your weekend Peter: ahh still got a headache An you party a lot Mary: nuh you know my kid s sick n stuff Peter: yeah i know thats erm al hamshi: hey guys how s it goin Peter: Great Mary: where ve you been last week al hamshi: ah you know camping with my girl friend"

I also have a vector of the speakers' names:

speakers <- c("Peter", "Mary", "al hamshi")

Using this vector as a component of my regex pattern I'm doing relatively well with this extraction:

library(stringr)
str_extract_all(convers, 
                paste("(?<=: )[\\w\\s]+(?= ", paste0(".*\\b(", paste(speakers, collapse = "|"), ")\\b.*"), ")", sep = ""))
[[1]]
[1] "hiya"                                        "hi how wz your weekend"                      "ahh still got a headache An you party a lot"
[4] "nuh you know my kid s sick n stuff"          "yeah i know thats erm al"                    "hey guys how s it goin"                     
[7] "Great"                                       "where ve you been last week"

However, the first part of the the third speaker's name (al) is contained in one of the extracted utterances (yeah i know thats erm al) and the last utterance by speaker al hamshi (ah you know camping with my girl friend) is missing from the output. How can the regex be improved so that all utterances get matched and extracted correctly?


Solution

  • A correct splitting approach would look like

    p2 <- paste0("\\s*\\b(?:", paste(speakers, collapse = "|"), ")(?=:)")
    strsplit(sub("^\\W+", "", gsub(p2, "", convers, perl=TRUE)), "\\s*:\\s*")[[1]]
    # => [1] "hiya"                                       
    # => [2] "hi how wz your weekend"                     
    # => [3] "ahh still got a headache An you party a lot"
    # => [4] "nuh you know my kid s sick n stuff"         
    # => [5] "yeah i know thats erm"                      
    # => [6] "hey guys how s it goin"                     
    # => [7] "Great"                                      
    # => [8] "where ve you been last week"                
    # => [9] "ah you know camping with my girl friend"    
    

    The regex to remove speakers from the string will look like

    \s*\b(?:Peter|Mary|al hamshi)(?=:)
    

    See the regex demo. It will match

    • \s* - 0+ whitespaces
    • \b - a word boundary
    • (?:Peter|Mary|al hamshi) - one of the speaker names
    • (?=:) - that must be followed with a : char.

    Then, the non-word chars at the start are removed with the sub("^\\W+", "", ...) call, and then the whole string is split with \s*:\s* regex that matches a : enclosed with 0+ whitespaces.

    Alternatively, you can use

    (?<=(?:Peter|Mary|al hamshi):\s).*?(?=\s*(?:Peter|Mary|al hamshi):|\z)
    

    See this regex demo. Details:

    • (?<=(?:Peter|Mary|al hamshi):\s) - a location immediately preceded with any speaker name and a whitespace
    • .*? - any 0+ chars (other than line break chars, use (?s) at the pattern start to make it match any chars) as few as possible
    • (?=\s*(?:Peter|Mary|al hamshi):|\z) - a location immediately followed with 0+ whitespaces, then any speaker name and a : or end of string.

    In R, you can use

    library(stringr)
    speakers <- c("Peter", "Mary", "al hamshi")
    convers <- "Peter: hiya Mary: hi how wz your weekend Peter: ahh still got a headache An you party a lot Mary: nuh you know my kid s sick n stuff Peter: yeah i know thats erm al hamshi: hey guys how s it goin Peter: Great Mary: where ve you been last week al hamshi: ah you know camping with my girl friend"
    p = paste0("(?<=(?:",paste(speakers, collapse="|"),"):\\s).*?(?=\\s*(?:", paste(speakers, collapse="|"),"):|\\z)")
    str_extract_all(convers, p)
    # => [[1]]
    # => [1] "hiya"                                       
    # => [2] "hi how wz your weekend"                     
    # => [3] "ahh still got a headache An you party a lot"
    # => [4] "nuh you know my kid s sick n stuff"         
    # => [5] "yeah i know thats erm"                      
    # => [6] "hey guys how s it goin"                     
    # => [7] "Great"                                      
    # => [8] "where ve you been last week"                
    # => [9] "ah you know camping with my girl friend"