I'm trying to find the position (index) of unquoted words in a string, but all my tests have been unsuccessful.
For the string: string='foo "bar" baz'
I'd like to get
(0, 3) # This for foo
(10, 13) # This for baz
# I'd like to skip the quoted "bar"
However, every regular expression I try includes the quoted 'bar'
or parts of it:
string='foo "bar" baz'
_RE_UNQUOTED_VALUES = re.compile(r"([^\"']\w+[^\"'])")
print([m.span() for m in _RE_UNQUOTED_VALUES.finditer(string)])
outputs: [(0, 4), (5, 8), (9, 13)]
Or using:
_RE_UNQUOTED_VALUES = re.compile(r"(?!(\"|'))\w+(?!(\"|'))")
# Outputs [(0, 3), (5, 7), (10, 13)]
Is this not doable with regular expressions? Am I misunderstanding how finditer()
works?
You can use
import re
string="foo 'bar' baz"
ms = re.finditer(r"""\b(?<!['"])\w+\b(?!['"])""", string)
print([(x.start(), x.end()) for x in ms])
# => [(0, 3), (10, 13)]
See the Python demo.
The \b(?<!['"])\w+\b(?!['"])
regex matches a word boundary first, then the (?<!')
negative lookbehind fails the match if there is a '
/"
char immediately on the left, then matches one or more word chars, checks the word boundary position again and the (?!['"])
negative lookahead fails the match if there is a '
/"
char immediately on the right.
See the regex demo.