I'm writing a python regex that looks through a text document for quoted strings (quotes of airline pilots recorded from blackboxes). I started by trying to write a regex with the following rules:
Return what is between quotes.
if it opens with single, only return if it closes with single.
if it opens with double, only return if it closes with double.
For instance I don't want to match "hi there', or 'hi there", but "hi there" and 'hi there'.
I use a testing page which contains things like:
CA "Runway 18, wind 230 degrees, five knots, altimeter 30."
AA "Roger that"
18:24:10 [flap lever moving into detent]
ST: "Some passenger's pushing a switch. May I?"
So I decided to start simple:
re.findall('("|\').*?\\1', page)
########## /("|').*?\1/ <-- raw regex I think I'm going for.
This regex acts very unexpectedly.
I thought it would:
Instead, it returns an array of quotes but never anything else.
['"', '"', "'", "'"]
I'm really confused because the equivalent (afaik) regex works just fine in VIM.
\("\|'\).\{-}\1/)
My question is this:
Why does it return only what is inside parenthesis as the match? Is this a flaw in my understanding of back references? If so then why does it work in VIM?
And how do I write the regex I'm looking for in python?
Thank you for your help!
Read the documentation. re.findall
returns the groups, if there are any. If you want the entire match you must group it all, or use re.finditer
. See this question.