Search code examples
pythonunicodeutf-8character-encoding

unable to get uppercase to 'ß' (german character called eszett)


Hello I have to convert a string column into its uppercase version, but when 'ß' is present in the string, it gets changed to 'SS' while doing uppercase I understand that this is because,earlier 'SS' was considered the uppercase of 'ß'. BUt in 2017, both 'SS' and the uppercase 'ß' is allowed.

and its unicode version is also available.

I have the following questions on this:

  1. Why is python not converting it to the uppercase 'ß'.

  2. Is it because of the unicode standard that is embedded in python? How to know which unicode standard python/jupyter notebook is using?

  3. Is there anyway to get the uppercase 'ß' instead of 'SS' in python?

Is i


Solution

  • Various Python versions use specific Unicode versions. For example, I think the original Python 3.7 used Unicode 10.0.0 which, while it has the letter available (it has had it since Unicode 5.1, I believe), still lists the old upper/lower mapping:

    00DF ß LATIN SMALL LETTER SHARP S
        = Eszett
        - German
        - uppercase is "SS"
        - nonstandard uppercase is 1E9E ẞ
    1E9E ẞ LATIN CAPITAL LETTER SHARP S
        - lowercase is 00DF ß
    

    Even the latest standard at the time of this answer, 13.0.0 (though this change was made in 11.0.0), appears to allow discretion as to how to convert lower to upper:

    00DF ß LATIN SMALL LETTER SHARP S
        = Eszett
        - German
        - not used in Swiss High German
        - uppercase is "SS" or 1E9E ẞ
    1E9E ẞ LATIN CAPITAL LETTER SHARP S
        - not used in Swiss High German
        - lowercase is 00DF ß
    

    The following table maps some Python version to Unicode version:

     Python     Unicode
    --------    -------
       3.5.9      8.0.0
      3.6.11      9.0.0
       3.7.8     11.0.0
    3.8.4rc1     12.1.0
     3.9.0b4     13.0.0
    3.10.0a0     13.0.0
    

    So you may well have to wait for a later version of Unicode (and a Python that uses that Unicode version) where the mapping is a little less wishy-washy than uppercase is "SS" or 1E9E ẞ". But this may actually be precluded by the Unicode stability policy which states, in part:

    If two characters form a case pair in a version of Unicode, they will remain a case pair in each subsequent version of Unicode. If two characters do not form a case pair in a version of Unicode, they will never become a case pair in any subsequent version of Unicode.

    You can make a case pair from a newly introduced character, assuming that the one you want to pair with is not already paired but that's not allowed here since:

    • this "new" character was introduced way back in Unicode 5.1; and
    • the character we'd want to pair it with is already paired.

    My reading of that leads me to believe that the only way to fix this without violating that policy, would be to introduce two new characters in a case pair, something like:

    ß LATIN SMALL LETTER SHARP S THAT IS LOWER OF ẞ
    ẞ LATIN CAPITAL LETTER SHARP S THAT IS UPPER OF ß
    

    However, I'm not sure that'll ever get past the Unicode consortium silliness filters :-)

    For an immediate fix, you can simply force that specific character to whatever you want it to be, before applying the inbuilt case change, something like:

    to_be_uppered.replace('ß', 'ẞ').upper()
    to_be_lowered.replace('ẞ', 'ß').lower()
    

    The latter appears to be unnecessary, at least on my version, Python 3.8.2. I include it just in case an earlier Python version may need it. It may even be worth putting these into a custom my_upper() and my_lower() function, if it turns out there are more cases like this that you need to handle.