Search code examples
pythonregexms-worddocxregex-group

Regex pattern to exclude timestamps


I have the following text:

Master of the universe\n\n(Jul 26, 2023 - 1:00pm)\n\n(Interviewee: Marina)\n\n\n\n(00:00:05 - 00:00:09)\n\n\t Alice: This project. Uh my job is to ask lots of questions.\n\n\n\n(00:00:10 - 00:00:11)\n\n\t Marina: What is it?\n\n\n\n(00:00:11 - 00:00:14)\n\n\t Alice: Uh uh impartially.\n\n\n\n(00:00:15 - 00:00:18)\n\n\t Alice: Uh so suddenly I don't work for a particular brand.\n\n\n\n(00:00:19 - 00:00:21)\n\n\t Alice: Uh I'm self-employed,\n\n\n\n(00:00:21 - 00:00:21)\n\n\t Marina: M M.\n\n\n\n(00:00:21 - 00:00:32)\n\n\t Alice: I do group interviews with lots of brands, from toothpaste to the product we're going to talk about today.\n\n\n\n(00:00:32 - 00:00:32)\n\n\t Marina: Okay.\n\n\n\n(00:00:33 - 00:00:37)\n\n\t Alice: Uh today we're gonna talk for an hour uh.\n\n\n\n(00:00:36 - 00:00:36)\n\n\t Marina: Okay.\n\n\n\n(00:00:37 - 00:00:39)\n\n\t 

From above text, I want to extract the name: text. For e.g.:

Alice: This project. Uh my job is to ask lots of questions.
Marina: What is it?
Alice: Uh uh impartially.
Alice: Uh so suddenly I don't work for a particular brand.
Alice: Uh I'm self-employed,
Marina: M M.
Alice: I do group interviews with lots of brands, from toothpaste to the product we're going to talk about today.
Marina: Okay.
Alice: Uh today we're gonna talk for an hour uh.
Marina: Okay.

I am able to identify the timestamps from this regex code, but not exclude them:

(?:[\\n]+\(\d{2}:\d{2}:\d{2} - \d{2}:\d{2}:\d{2}\)[\\n\\t\\s]+|$)

I need a regex pattern that can exclude all the timestamps and other text, only keeping the name: text as shown above.

EDIT: I forgot to mention, exclude the lines that match the Interviewee name.

P.S: I do not want a python code to do a regex replace using the above pattern. I just a complete pattern to find matches for name: text


Solution

  • Test Code : https://ideone.com/egOlTP

    I would do something like this using re

    pattern = r'(\w+): (.+)'
    
    matches = re.findall(pattern, input_text)
    
    for match in matches:
        name, text = match
        print(f"{name}: {text}")
    

    this prints the pattern that you are looking for. Hope this helps.

    Output : enter image description here

    If you do not want any other (like Interviewee - Displayed on the output screen)

    replace the pattern

    pattern = r'(\w+): (.+)'
    

    with this

    pattern = r' (\w+): (.+)'