Search code examples
pythonregexpython-3.xregex-greedynon-greedy

Python regex non-greedy acting like greedy


I am working with transcripts and having trouble with matching patterns in non-greedy fashion. It is still grabbing way too much and looks like doing greedy matches.

A transcript looks like this:

>> John doe: Hello, I am John Doe.

>> Hello, I am Jane Doe.

>> Thank you for coming, we will start in two minutes.

>> Sam Smith: [no audio] Good morning, everyone.

To find the name of speakers within >> (WHATEVER NAME):, I wrote

pattern=re.compile(r'>>(.*?):')
transcript='>> John doe: Hello, I am John Doe. >> Hello, I am Jane Doe. >> Thank you for coming, we will start in two minutes. >> Sam Smith: [no audio] Good morning, everyone.'
re.findall(pattern, transcript)

I expected 'John Doe' and 'Sam Smith', but it is giving me 'John Doe' and 'Hello, I am Jane Doe. >> Thank you for coming, we will start in two minutes. >> Sam Smith'

I am confused because .*? is non-greedy, which (I think) should be able to grab 'Sam Smith'. How should I fix the code so that it only grabs whatever in >> (WHATEVER NAME):? Also, I am using Python 3.6.

Thanks!


Solution

  • Do you really need regex? You can split on >> prompts and then filter out your names.

    >>> [i.split(':')[0].strip() for i in transcript.split('>>') if ':' in i]
    ['John doe', 'Sam Smith']