Search code examples
pythonregexpython-regex

Simple case folding vs full case folding in Python regex module


This is the module I'm asking about: https://pypi.org/project/regex/, it's Matthew Barnett's regex.

In the project description page, the difference in behavior between V0 and V1 are stated as (note what's in bold):

Old vs new behaviour

In order to be compatible with the re module, this module has 2 behaviours:

  • Version 0 behaviour (old behaviour, compatible with the re module):

    Please note that the re module’s behaviour may change over time, and I’ll endeavour to match that behaviour in version 0.

    • Indicated by the VERSION0 or V0 flag, or (?V0) in the pattern.
    • Case-insensitive matches in Unicode use simple case-folding by default.
  • Version 1 behaviour (new behaviour, possibly different from the re module):

    • Indicated by the VERSION1 or V1 flag, or (?V1) in the pattern.
    • Case-insensitive matches in Unicode use full case-folding by default.

If no version is specified, the regex module will default to regex.DEFAULT_VERSION.

I tried a few examples myself but didn't figure out what it does:

Python 3.6.7 (default, Oct 22 2018, 11:32:17)
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import regex
>>> r = regex.compile("(?V0i)и")
>>> r
regex.Regex('(?V0i)и', flags=regex.I | regex.V0)
>>> r.search("И")
<regex.Match object; span=(0, 1), match='И'>
>>> regex.search("(?V0i)é", "É")
<regex.Match object; span=(0, 1), match='É'>
>>> regex.search("(?V0i)é", "E")
>>> regex.search("(?V1i)é", "E")

What is the difference between simple case-folding and full case-folding? Or can you provide an example where a (case insensitive) regex matches something in V1 but not in V0?


Solution

  • It follows the Unicode case folding table. Excerpt:

    # The entries in this file are in the following machine-readable format:
    #
    # <code>; <status>; <mapping>; # <name>
    #
    # The status field is:
    # C: common case folding, common mappings shared by both simple and full mappings.
    # F: full case folding, mappings that cause strings to grow in length. Multiple characters are separated by spaces.
    # S: simple case folding, mappings to single characters where different from F.
    
    [...]
    
    # Usage:
    #  A. To do a simple case folding, use the mappings with status C + S.
    #  B. To do a full case folding, use the mappings with status C + F.
    

    The folding is only different for a few special characters, examples are small and capital latin sharp s:

    00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
    
    [...]
    
    1E9E; F; 0073 0073; # LATIN CAPITAL LETTER SHARP S
    1E9E; S; 00DF; # LATIN CAPITAL LETTER SHARP S