Search code examples
pythonregexpython-2.7case-sensitive

Python seems to incorrectly identify case-sensitive string using regex


I'm checking for a case-sensitive string pattern using Python 2.7 and it seems to return an incorrect match. I've run the following tests:

>>> import re
>>> rex_str = "^((BOA_[0-9]{4}-[0-9]{1,3})(?:CO)?.(?i)pdf$)"
>>> not re.match(rex_str, 'BOA_1988-148.pdf')
>>> False
>>> not re.match(rex_str, 'BOA_1988-148.PDF')
>>> False
>>> not re.match(rex_str, 'BOA1988-148.pdf')
>>> True
>>> not re.match(rex_str, 'boa_1988-148.pdf')
>>> False

The first three tests are correct, but the final test, 'boa_1988-148.pdf' should return True because the pattern is supposed to treat the first 3 characters (BOA) as case-sensitive.

I checked the expression with an online tester (https://regex101.com/) and the pattern was correct, flagging the final as a no match because the 'boa' was lower case. Am I missing something or do you have to explicitly declare a group as case-sensitive using a case-sensitive mode like (?c)?


Solution

  • Flags do not apply to portions of a regex. You told the regex engine to match case insensitively:

    (?i)
    

    From the the syntax documentation:

    (?aiLmsux)  
    

    (One or more letters from the set 'a', 'i', 'L', 'm', 's', 'u', 'x'.) The group matches the empty string; the letters set the corresponding flags: re.A (ASCII-only matching), re.I (ignore case), re.L (locale dependent), re.M (multi-line), re.S (dot matches all), and re.X (verbose), for the entire regular expression. (The flags are described in Module Contents.) This is useful if you wish to include the flags as part of the regular expression, instead of passing a flag argument to the re.compile() function. Flags should be used first in the expression string.

    Emphasis mine, the flag applies to the whole pattern, not just a substring. If you need to match just pdf or PDF, use that in your pattern directly:

    r"^((BOA_[0-9]{4}-[0-9]{1,3})(?:CO)?.(?:pdf|PDF)$)"
    

    This matches either .pdf or .PDF. If you need to match any mix of uppercase and lowercase, use:

    r"^((BOA_[0-9]{4}-[0-9]{1,3})(?:CO)?.[pP][dD][fF]$)"