Why am I seeing different results for these two nearly identical Ruby regex patterns, and why is one matching what I think it shouldn't?

Using Ruby 1.9.2, I have the following Ruby code in IRB:

> r1 = /^(?=.*[\d])(?=.*[\W]).{8,20}$/i
> r2 = /^(?=.*\d)(?=.*\W).{8,20}$/i
> a = ["password", "1password", "password1", "pass1word", "password 1"]
> a.each {|p| puts "r1: #{r1.match(p) ? "+" : "-"} \"#{p}\"".ljust(25) + "r2: #{r2.match(p) ? "+" : "-"} \"#{p}\""}

This results in the following output:

r1: - "password"         r2: - "password"
r1: + "1password"        r2: - "1password"
r1: + "password1"        r2: - "password1"
r1: + "pass1word"        r2: - "pass1word"
r1: + "password 1"       r2: + "password 1"

1.) Why do the results differ?

2.) Why would r1 match on strings 2, 3 and 4? Wouldn't the (?=.*[\W]) lookahead cause it to fail since there aren't any non-word characters in those examples?

Solution

This results from the interaction between a couple of regex features and Unicode. \W is all non-word characters, which includes 212A - "KELVIN SIGN" K (PDF link) and 017F - "LATIN SMALL LETTER LONG S" ſ (PDF link). The /i adds lower case versions of both of these, which are the “normal” k and s characters (006B - "LATIN SMALL LETTER K" and 0073 "LATIN SMALL LETTER S" (PDF link)).

So it’s the s in password that’s being interpreted as a non-word character in certain cases.

Note that this only seems to occur when the \W is in a character class (i.e. [\W]). Also I can only reproduce this in irb, inside a standalone script it seems to work as expected.

See the Ruby bug about this for more information.