I have the following text:
Master of the universe\n\n(Jul 26, 2023 - 1:00pm)\n\n(Interviewee: Marina)\n\n\n\n(00:00:05 - 00:00:09)\n\n\t Alice: This project. Uh my job is to ask lots of questions.\n\n\n\n(00:00:10 - 00:00:11)\n\n\t Marina: What is it?\n\n\n\n(00:00:11 - 00:00:14)\n\n\t Alice: Uh uh impartially.\n\n\n\n(00:00:15 - 00:00:18)\n\n\t Alice: Uh so suddenly I don't work for a particular brand.\n\n\n\n(00:00:19 - 00:00:21)\n\n\t Alice: Uh I'm self-employed,\n\n\n\n(00:00:21 - 00:00:21)\n\n\t Marina: M M.\n\n\n\n(00:00:21 - 00:00:32)\n\n\t Alice: I do group interviews with lots of brands, from toothpaste to the product we're going to talk about today.\n\n\n\n(00:00:32 - 00:00:32)\n\n\t Marina: Okay.\n\n\n\n(00:00:33 - 00:00:37)\n\n\t Alice: Uh today we're gonna talk for an hour uh.\n\n\n\n(00:00:36 - 00:00:36)\n\n\t Marina: Okay.\n\n\n\n(00:00:37 - 00:00:39)\n\n\t
From above text, I want to extract the name: text
. For e.g.:
Alice: This project. Uh my job is to ask lots of questions.
Marina: What is it?
Alice: Uh uh impartially.
Alice: Uh so suddenly I don't work for a particular brand.
Alice: Uh I'm self-employed,
Marina: M M.
Alice: I do group interviews with lots of brands, from toothpaste to the product we're going to talk about today.
Marina: Okay.
Alice: Uh today we're gonna talk for an hour uh.
Marina: Okay.
I am able to identify the timestamps from this regex code, but not exclude them:
(?:[\\n]+\(\d{2}:\d{2}:\d{2} - \d{2}:\d{2}:\d{2}\)[\\n\\t\\s]+|$)
I need a regex pattern that can exclude all the timestamps and other text, only keeping the name: text
as shown above.
EDIT: I forgot to mention, exclude the lines that match the Interviewee name.
P.S: I do not want a python code to do a regex replace using the above pattern. I just a complete pattern to find matches for name: text
Test Code : https://ideone.com/egOlTP
I would do something like this using re
pattern = r'(\w+): (.+)'
matches = re.findall(pattern, input_text)
for match in matches:
name, text = match
print(f"{name}: {text}")
this prints the pattern that you are looking for. Hope this helps.
If you do not want any other (like Interviewee - Displayed on the output screen)
replace the pattern
pattern = r'(\w+): (.+)'
with this
pattern = r' (\w+): (.+)'