I'm trying to understand how all the regex flags and bitwise tie together. The only thing I can really find related to this is in the documentation where it says you can use Bitwise or the '|' operator. I've always used that operator when using flags in the past, but I wanna know how it works and what is the advantage of using the other operators such as (&, ^, ~, >>, <<)
From my understanding, each flag represents a value?
print('{:>15} = {}'.format('re.ASCII', int(re.ASCII)))
print('{:>15} = {}'.format('re.DEBUG', int(re.DEBUG)))
print('{:>15} = {}'.format('re.IGNORECASE', int(re.IGNORECASE)))
print('{:>15} = {}'.format('re.LOCALE', int(re.LOCALE)))
print('{:>15} = {}'.format('re.MULTILINE', int(re.MULTILINE)))
print('{:>15} = {}'.format('re.DOTALL', int(re.DOTALL)))
print('{:>15} = {}'.format('re.VERBOSE', int(re.VERBOSE)))
> re.ASCII = 256
> re.DEBUG = 128
> re.IGNORECASE = 2
> re.LOCALE = 4
> re.MULTILINE = 8
> re.DOTALL = 16
> re.VERBOSE = 64
What would be the difference in these examples:
re.compile('[\w]+', flags=(re.IGNORECASE | re.MULTILINE)
re.compile('[\w]+', flags=(re.IGNORECASE & re.MULTILINE)
re.compile('[\w]+', flags=(re.IGNORECASE ^ re.MULTILINE)
or
re.compile('[\w]+', flags=(re.DOTALL | re.MULTILINE)
re.compile('[\w]+', flags=(re.DOTALL & re.MULTILINE)
re.compile('[\w]+', flags=(re.DOTALL ^ re.MULTILINE)
Bitwise table for reference:
Operator | Example | Meaning |
---|---|---|
& | a & b | Bitwise AND |
| | a | b | Bitwise OR |
^ | a ^ b | Bitwise XOR (exclusive OR) |
~ | ~a | Bitwise NOT |
<< | a << n | Bitwise left shift |
>> | a >> n | Bitwise right shift |
It's just common practice to use flags. In python, C++ programs, maybe others too, flags are usually used in this kind of style. Let me give you an example.
re.compile('[\w]+', flags=(re.IGNORECASE | re.MULTILINE)
When you set the flags like above, this means that you want to apply both this two settings, namely IGNORECASE and MULTILINE.
I think you feel confused why this would apply both two settings. This is because python reinterpreter is highly possible to handle this like below:
if flags & re.IGNORECASE:
handle_in_ignorecase_way()
if flags & re.MULTILINE:
handle_in_multiline_way()
This is why flags are set usually in patterns like 1, 2, 4, 8, etc. because reinterpreter can easily handle them using & and | when users give multiple flags. The flags are placed in different bit positions, and they could be parsed by simple bit hacks.