Search code examples
rubyregexnegative-lookbehind

Why is this negative look behind wrong?


def get_hashtags(post)
    tags = []
    post.scan(/(?<![0-9a-zA-Z])(#+)([a-zA-Z]+)/){|x,y| tags << y}
    tags
end

Test.assert_equals(get_hashtags("two hashs##in middle of word#"), [])
#Expected: [], instead got: ["in"]

Should it not look behind to see if the match doesnt begin with a word or number? Why is it still accepting 'in' as a valid match?


Solution

  • You should use \K rather than a negative lookbehind. That allows you to simplify your regex considerably: no need for a pre-defined array, capture groups or a block.

    \K means "discard everything matched so far". The key here is that variable-length matches can precede \K, whereas (in Ruby and most other languages) variable-length matches are not permitted in (negative or positive) lookbehinds.

    r = /
        [^0-9a-zA-Z#] # do not match any character in the character class
        \#+           # match one or more pound signs
        \K            # discard everything matched so far
        [a-zA-Z]+     # match one or more letters
        /x            # extended mode
    

    Note # in \#+ need not be escaped if I weren't writing the regex in extended mode.

    "two hashs##in middle of word#".scan r
      #=> []
    
    "two hashs&#in middle of word#".scan r
      #=> ["in"]
    
    "two hashs#in middle of word&#abc of another word.###def ".scan r
       #=> ["abc", "def"]