I have a conversation between several speakers recorded as a single string:
convers <- "Peter: hiya Mary: hi how wz your weekend Peter: ahh still got a headache An you party a lot Mary: nuh you know my kid s sick n stuff Peter: yeah i know thats erm al hamshi: hey guys how s it goin Peter: Great Mary: where ve you been last week al hamshi: ah you know camping with my girl friend"
I also have a vector of the speakers' names:
speakers <- c("Peter", "Mary", "al hamshi")
Using this vector as a component of my regex pattern I'm doing relatively well with this extraction:
library(stringr)
str_extract_all(convers,
paste("(?<=: )[\\w\\s]+(?= ", paste0(".*\\b(", paste(speakers, collapse = "|"), ")\\b.*"), ")", sep = ""))
[[1]]
[1] "hiya" "hi how wz your weekend" "ahh still got a headache An you party a lot"
[4] "nuh you know my kid s sick n stuff" "yeah i know thats erm al" "hey guys how s it goin"
[7] "Great" "where ve you been last week"
However, the first part of the the third speaker's name (al
) is contained in one of the extracted utterances (yeah i know thats erm al
) and the last utterance by speaker al hamshi
(ah you know camping with my girl friend
) is missing from the output. How can the regex be improved so that all utterances get matched and extracted correctly?
A correct splitting approach would look like
p2 <- paste0("\\s*\\b(?:", paste(speakers, collapse = "|"), ")(?=:)")
strsplit(sub("^\\W+", "", gsub(p2, "", convers, perl=TRUE)), "\\s*:\\s*")[[1]]
# => [1] "hiya"
# => [2] "hi how wz your weekend"
# => [3] "ahh still got a headache An you party a lot"
# => [4] "nuh you know my kid s sick n stuff"
# => [5] "yeah i know thats erm"
# => [6] "hey guys how s it goin"
# => [7] "Great"
# => [8] "where ve you been last week"
# => [9] "ah you know camping with my girl friend"
The regex to remove speakers from the string will look like
\s*\b(?:Peter|Mary|al hamshi)(?=:)
See the regex demo. It will match
\s*
- 0+ whitespaces\b
- a word boundary(?:Peter|Mary|al hamshi)
- one of the speaker names(?=:)
- that must be followed with a :
char.Then, the non-word chars at the start are removed with the sub("^\\W+", "", ...)
call, and then the whole string is split with \s*:\s*
regex that matches a :
enclosed with 0+ whitespaces.
Alternatively, you can use
(?<=(?:Peter|Mary|al hamshi):\s).*?(?=\s*(?:Peter|Mary|al hamshi):|\z)
See this regex demo. Details:
(?<=(?:Peter|Mary|al hamshi):\s)
- a location immediately preceded with any speaker name and a whitespace.*?
- any 0+ chars (other than line break chars, use (?s)
at the pattern start to make it match any chars) as few as possible(?=\s*(?:Peter|Mary|al hamshi):|\z)
- a location immediately followed with 0+ whitespaces, then any speaker name and a :
or end of string.In R, you can use
library(stringr)
speakers <- c("Peter", "Mary", "al hamshi")
convers <- "Peter: hiya Mary: hi how wz your weekend Peter: ahh still got a headache An you party a lot Mary: nuh you know my kid s sick n stuff Peter: yeah i know thats erm al hamshi: hey guys how s it goin Peter: Great Mary: where ve you been last week al hamshi: ah you know camping with my girl friend"
p = paste0("(?<=(?:",paste(speakers, collapse="|"),"):\\s).*?(?=\\s*(?:", paste(speakers, collapse="|"),"):|\\z)")
str_extract_all(convers, p)
# => [[1]]
# => [1] "hiya"
# => [2] "hi how wz your weekend"
# => [3] "ahh still got a headache An you party a lot"
# => [4] "nuh you know my kid s sick n stuff"
# => [5] "yeah i know thats erm"
# => [6] "hey guys how s it goin"
# => [7] "Great"
# => [8] "where ve you been last week"
# => [9] "ah you know camping with my girl friend"