Search code examples
stringunicodestring-comparisonrakuunicode-string

Foldcase conversion between (German) lower ß (U+00DF) and upper ẞ (U+1E9E)?


According to Wikipedia, in 2017 using an uppercase (Unicode U+1E9E) was officially adopted--at least as an option--for what may in fact be a subset of fully-capitalized words in German:

In June of that year, the Council for German Orthography officially adopted a rule that ⟨ẞ⟩ would be an option for capitalizing ⟨ß⟩ besides the previous capitalization as ⟨SS⟩ (i.e., variants STRASSE and STRAẞE would be accepted as equally valid).2

It seems like this addition to the German language would greatly simplify case-comparisons between strings (so-called "case-folding" or "fold-case" comparisons). Note, I started this inquiry trying to understand Raku's (a.k.a. Perl6's) implementation, but the question in fact seems to generalize to other programming languages. Here is Raku's default implementation--starting with 13 words from rfdr_Regeln_2017.pdf that have been lowercased (via Raku's .lc function):

~$ cat TO_ẞ_OR_NOT_TO_ẞ.txt
maß straße grieß spieß groß grüßen außen außer draußen strauß beißen fleiß heißen
~$ raku -ne '.words>>.match(/^ <:Ll>+ $/).say;' TO_ẞ_OR_NOT_TO_ẞ.txt
(「maß」 「straße」 「grieß」 「spieß」 「groß」 「grüßen」 「außen」 「außer」 「draußen」 「strauß」 「beißen」 「fleiß」 「heißen」)
~$ raku -ne '.uc.say;' TO_ẞ_OR_NOT_TO_ẞ.txt
MASS STRASSE GRIESS SPIESS GROSS GRÜSSEN AUSSEN AUSSER DRAUSSEN STRAUSS BEISSEN FLEISS HEISSEN
~$ raku -ne '.fc.say;' TO_ẞ_OR_NOT_TO_ẞ.txt
mass strasse griess spiess gross grüssen aussen ausser draussen strauss beissen fleiss heissen

I'm suprised that Raku's fc fold-case implementation essentially converts to lowercase ss. It's no surprise then that trying to search for eq string equality between the upper/lower "round-tripped" words and the original are all False:

~$ raku -ne 'for .words {print $_.uc.lc eq $_.lc }; "".put;'  TO_ẞ_OR_NOT_TO_ẞ.txt
FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse

Fold-cased (.fc) words match, but they do so on the basis of ss characters, not ß:

~$ raku -ne 'for .words {print $_.uc.lc eq $_.fc }; "".put;' TO_ẞ_OR_NOT_TO_ẞ.txt
TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue

Starting from a capital-ẞ, taking just one capitalized/uppercase word again demonstrates the dichotomy:

~$ echo "straße STRASSE STRAẞE" | raku -ne ' .put for .words;'
straße
STRASSE
STRAẞE
~$ echo "straße STRASSE STRAẞE" | raku -ne ' .lc.say for .words;'
straße
strasse
straße
~$ echo "straße STRASSE STRAẞE" | raku -ne ' for .words { say $_.lc eq "straße" };'
True
False
True
~$ echo "straße STRASSE STRAẞE" | raku -ne ' for .words { say $_.lc eq $_.fc };'
False
True
False

Have any programming languages instituted a foldcase conversion between lowercase ß <--> uppercase , by default? What programming languages have added lowercase ß <--> uppercase conversion, as an option (or via a library)? Many Questions/Answers on StackOverflow pre-date the 2017 decision, so I'm looking for up-to-date answers.

[ADDENDUM: I note via this FAQ that the Unicode Consortium's rules appear to be at odds with the 2017 decision of the Council for German Orthography].


Solution

  • 1. Lowercase/Uppercase:

    In Raku, the default conversion from lowercase German ß is to uppercase SS, but this can be overcome (as shown below).

    The Unicode Consortium has a special FAQ on these letters in the German language. However, if one wants to work around the first uc uppercasing issue using Raku, the "ß" => "ẞ" characters can be appropriately translated prior to calling the bog-standard uc uppercase method/function:

    ~$ cat TO_ẞ_OR_NOT_TO_ẞ_tclc.txt
    Maß Straße Grieß Spieß Groß Grüßen Außen Außer Draußen Strauß Beißen Fleiß Heißen
    raku -ne '.uc.put;' TO_ẞ_OR_NOT_TO_ẞ_tclc.txt
    MASS STRASSE GRIESS SPIESS GROSS GRÜSSEN AUSSEN AUSSER DRAUSSEN STRAUSS BEISSEN FLEISS HEISSEN
    ~$ raku -ne '.trans("ß" => "ẞ").put;' TO_ẞ_OR_NOT_TO_ẞ_tclc.txt
    Maẞ Straẞe Grieẞ Spieẞ Groẞ Grüẞen Auẞen Auẞer Drauẞen Strauẞ Beiẞen Fleiẞ Heiẞen
    ~$ raku -ne '.trans("ß" => "ẞ").uc.put;' TO_ẞ_OR_NOT_TO_ẞ_tclc.txt
    MAẞ STRAẞE GRIEẞ SPIEẞ GROẞ GRÜẞEN AUẞEN AUẞER DRAUẞEN STRAUẞ BEIẞEN FLEIẞ HEIẞEN
    

    The code above works to uppercase text with instead of SS--and in true Raku/Perl spirit--there's more than one way to do it (TMTOWTDI):

    ~$ raku -ne '.trans("ß" => "ẞ").uc.put;' file
    ~$ raku -e '.trans("ß" => "ẞ").uc.put for lines();' file
    ~$ raku -e 'put .trans("ß" => "ẞ").uc for lines();' file
    ~$ raku -e 'slurp.trans("ß" => "ẞ").uc.put;' file
    ~$ raku -e 'slurp.trans( "\x[00DF]" => "\x[1E9E]" ).uc.put;' file
    ~$ raku -e 'slurp.trans("LATIN SMALL LETTER SHARP S".uniparse => "LATIN CAPITAL LETTER SHARP S".uniparse).uc.put;' file
    

    2. Foldcase:

    The Unicode Consortium promulgates a rule that foldcase pairs should be stable (according to the Unicode Casefolding Stability Policy).

    As for fc foldcase stability, I had hoped that prior conversion of "ß" => "ẞ" would provide a "30th-uppercase character" that would act as a bicameral foldcase partner of lowercase ß (in a pair). The code below seems promising in that starting with a small sample of mixed-case text, you can "round-trip" from uppercase-to-lowercase, and still have output text matching lowercase:

    ~$ raku -ne 'for .words {print $_.uc.lc eq $_.lc }; "".put;' TO_ẞ_OR_NOT_TO_ẞ_tclc.txt
    FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
    ~$ raku -ne 'for .words {print $_.trans("ß" => "ẞ").uc.lc eq $_.lc }; "".put;' TO_ẞ_OR_NOT_TO_ẞ_tclc.txt
    TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
    

    However, the fc foldcase code below shows that the present course of action is to take an uppercase and convert to lowercase ss (not to lowercase ß). Essentially .fc foldcase converts uppercase or SS to lowercase ss, regardless:

    ~$ raku -ne '.trans("ß" => "ẞ").fc.put;' TO_ẞ_OR_NOT_TO_ẞ_tclc.txt
    mass strasse griess spiess gross grüssen aussen ausser draussen strauss beissen fleiss heissen
    ~$ raku -ne 'for .words {print $_.trans("ß" => "ẞ").uc.fc eq $_.fc }; "".put;' TO_ẞ_OR_NOT_TO_ẞ_tclc.txt
    TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
    ~$ raku -ne 'for .words {print $_.trans("ß" => "ẞ").uc.lc eq $_.fc }; "".put;' TO_ẞ_OR_NOT_TO_ẞ_tclc.txt
    FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
    

    Changes anticipated? According to a 2017 StackOverflow post, "Just wait half a century."

    https://raku.org