Search code examples
pythonpython-3.xpython-re

Understanding regex flags and Bitwise operators


I'm trying to understand how all the regex flags and bitwise tie together. The only thing I can really find related to this is in the documentation where it says you can use Bitwise or the '|' operator. I've always used that operator when using flags in the past, but I wanna know how it works and what is the advantage of using the other operators such as (&, ^, ~, >>, <<)

From my understanding, each flag represents a value?

print('{:>15} = {}'.format('re.ASCII',      int(re.ASCII)))
print('{:>15} = {}'.format('re.DEBUG',      int(re.DEBUG)))
print('{:>15} = {}'.format('re.IGNORECASE', int(re.IGNORECASE)))
print('{:>15} = {}'.format('re.LOCALE',     int(re.LOCALE)))
print('{:>15} = {}'.format('re.MULTILINE',  int(re.MULTILINE)))
print('{:>15} = {}'.format('re.DOTALL',     int(re.DOTALL)))
print('{:>15} = {}'.format('re.VERBOSE',    int(re.VERBOSE)))

>       re.ASCII = 256
>       re.DEBUG = 128
>  re.IGNORECASE = 2
>      re.LOCALE = 4
>   re.MULTILINE = 8
>      re.DOTALL = 16
>     re.VERBOSE = 64

What would be the difference in these examples:

re.compile('[\w]+', flags=(re.IGNORECASE | re.MULTILINE)
re.compile('[\w]+', flags=(re.IGNORECASE & re.MULTILINE)
re.compile('[\w]+', flags=(re.IGNORECASE ^ re.MULTILINE)
or
re.compile('[\w]+', flags=(re.DOTALL | re.MULTILINE)
re.compile('[\w]+', flags=(re.DOTALL & re.MULTILINE)
re.compile('[\w]+', flags=(re.DOTALL ^ re.MULTILINE)

Bitwise table for reference:

Operator Example Meaning
& a & b Bitwise AND
| a | b Bitwise OR
^ a ^ b Bitwise XOR (exclusive OR)
~ ~a Bitwise NOT
<< a << n Bitwise left shift
>> a >> n Bitwise right shift

Solution

  • It's just common practice to use flags. In python, C++ programs, maybe others too, flags are usually used in this kind of style. Let me give you an example.

    re.compile('[\w]+', flags=(re.IGNORECASE | re.MULTILINE)
    

    When you set the flags like above, this means that you want to apply both this two settings, namely IGNORECASE and MULTILINE.

    I think you feel confused why this would apply both two settings. This is because python reinterpreter is highly possible to handle this like below:

    if flags & re.IGNORECASE:
        handle_in_ignorecase_way()
    
    if flags & re.MULTILINE:
        handle_in_multiline_way()
    

    This is why flags are set usually in patterns like 1, 2, 4, 8, etc. because reinterpreter can easily handle them using & and | when users give multiple flags. The flags are placed in different bit positions, and they could be parsed by simple bit hacks.