Search code examples
pythonregextokenize

Find indexes of unquoted words in a string using `re.finditer()` method


I'm trying to find the position (index) of unquoted words in a string, but all my tests have been unsuccessful.

For the string: string='foo "bar" baz' I'd like to get

(0, 3)   # This for foo
(10, 13) # This for baz
# I'd like to skip the quoted "bar"

However, every regular expression I try includes the quoted 'bar' or parts of it:

string='foo "bar" baz'
_RE_UNQUOTED_VALUES = re.compile(r"([^\"']\w+[^\"'])")
print([m.span() for m in _RE_UNQUOTED_VALUES.finditer(string)])

outputs: [(0, 4), (5, 8), (9, 13)]

Or using:

_RE_UNQUOTED_VALUES = re.compile(r"(?!(\"|'))\w+(?!(\"|'))")
# Outputs [(0, 3), (5, 7), (10, 13)]

Is this not doable with regular expressions? Am I misunderstanding how finditer() works?


Solution

  • You can use

    import re
    string="foo 'bar' baz"
    ms = re.finditer(r"""\b(?<!['"])\w+\b(?!['"])""", string)
    print([(x.start(), x.end()) for x in ms])
    # => [(0, 3), (10, 13)]
    

    See the Python demo.

    The \b(?<!['"])\w+\b(?!['"]) regex matches a word boundary first, then the (?<!') negative lookbehind fails the match if there is a '/" char immediately on the left, then matches one or more word chars, checks the word boundary position again and the (?!['"]) negative lookahead fails the match if there is a '/" char immediately on the right.

    See the regex demo.