Search code examples
pythonregex

Regex - negative lookbehind for any character excluding pure whitespace


I'm trying to write a regex pattern that will fail a match if the preceding pattern contains any character except pure whitespace, for example

--hello (match)
--goodbye (match)
ROW_NUMBER() OVER (ORDER BY DATE) --date (fail)
  --comment with some indentation (match)
    --another comment with some indentation (match)

The closest I've got to is with this pattern I made (?<!.)--.*\n, that gives me this result

--hello (match)
--goodbye (match)
ROW_NUMBER() OVER (ORDER BY DATE) --date (fail)
  --comment with some indentation (fail)
    --another comment with some indentation (fail)

I've tried (?<!\s)--.*\n and (?<=\S)--.*\n but both return no matches at all

EDIT: a regexr.com illustrating the issue more clearly regexr.com/6j0mt


Solution

  • With PyPi regex, you can use

    import regex
    
    text = r"""--hello
    --goodbye
    ROW_NUMBER() OVER (ORDER BY DATE) --date
      --comment with some indentation
        --another comment with some indentation"""
    
    print( regex.findall(r'(?<=^[^\S\r\n]*)--.*', text, regex.M) )
    # => ['--hello', '--goodbye', '--comment with some indentation', '--another comment with some indentation']
    

    See this Python demo online.

    Or, with the default Python re:

    import re
     
    text = r"""--hello
    --goodbye
    ROW_NUMBER() OVER (ORDER BY DATE) --date
      --comment with some indentation
        --another comment with some indentation"""
     
    print( re.findall(r'^[^\S\r\n]*(--.*)', text, re.M) )
    

    See this Python demo.

    Pattern details

    • (?<=^[^\S\r\n]*) - a positive lookbehind that matches a location that is immediately preceded with start of string/line and zero or more horizontal whitespaces
    • ^ - start of a string (here, a line, because re.M / regex.M option is used)
    • [^\S\r\n]* - zero or more chars other than non-whitespace, CR and LF chars (any whitespace but carriage returns and line feed chars)
    • (--.*) - Group 1: -- and the rest of the line (.* matches zero or more chars other than line break chars as many as possible).