Search code examples
regexpowershelldiacriticstransliteration

Powershell Regex for German Umlaute based on upper/lowercase and position in String


I am trying to write a Script in Powershell to convert German Umlaute

ä, ö, ü, ß to ae, oe, ue, ss 

Ä, Ö, Ü, ß to AE or Ae, UE or Ue, and SS.

The Problem is that i also need to differentiate based on the Position of the Umlaut.

ÜNLÜ > UENLUE
Ünlü > Uenlue (Ue)
SCHNEEWEIß > SCHNEEWEISS
Schneeweiß > Schneeweiss
Geßl > Gessl
GEßL > GESSL
Josef Öbinger > Josef Oebinger (one string)
Jürgen MÜLLER > Juergen MUELLER (one string)

The main Problem ruining my day is the Umlaut ß

There is no upper and lower case for ß

I need to identify ß based on wether the previous character was uppercase or lowercase

I have tried various regex like [ÄÖÜßA-Z]{1,}(?![\sa-zäüö])[ÄÖÜßA-Z] or [ÄÖÜß][^a-z]

It is basically impossible for me to figure out ss or SS. Apart from that, words like ÜNLÜ only get recognised with only one Umlaut because the letter with the umlaut is at the end of the Word.

I need 3 matching regex patterns. One for uppercase and one for lowercase and one for mixed case (Oebinger)

Those 3 Patterns will then be put inside 3 IF conditions in powershell where i can then blindly convert based on the matched pattern.

[ÄÖÜß][^a-z] works for ÜNLÜ > UENLUE

[äöüß][^A-Z] works for Jürgen > Juergen

but the ß in Schneeweiß and SCHNEEWEIß is matched with both patterns. That is not what i want.

I need a pattern that can check wether the letter before and after ß is lowercase or uppercase. If lowercase than ß = ss, if uppercase then ß = SS

The 3rd case, the mixed case does not really require a separate regex. I could basically take the String Jürgen MÜLLER, run it in powerscript through both patterns. First Pattern would convert it to Jürgen MUELLER. Take this and run it again to get Juergen MUELLER.

The Umlaut ß is always same. Lowercase = Uppercase. This is what makes the whole thing so difficult.

I am losing hope. Please help me guys.


Solution

  • PowerShell (Core) 7+ offers a concise solution, given that the -replace operator there accepts a script block as the substitution operand, which enables flexible, dynamic substitutions based on each match found:

    $strings = @(
      'ÜNLÜ'           # > UENLUE
      'Ünlü'           # > Uenlue (Ue)
      'SCHNEEWEIß'     # > SCHNEEWEISS
      'Schneeweiß'     # > Schneeweiss
      'Geßl'           # > Gessl
      'GEßL'           # > GESSL
      'Josef Öbinger'  # > Josef Oebinger
      'Jürgen MÜLLER'  # > Juergen MUELLER
      'THEÖ HÄRSHERIN' # > THEOE HAERSHERIN
      'MÄßIG'          # > MAESSIG
    )
    
    $strings `
      -replace '[äöü](?:(?=ß)|\p{L})?', { 
        ([string] $_.Value[0]).Normalize('FormD')[0] + 
          ([char]::IsUpper($_.Value[1] ?? $_.Value[0]) ? 'E' : 'e') +
          $_.Value[1]
      } `
      -replace '.ß', { 
        $_.Value[0] + ([char]::IsUpper($_.Value[0]) ? 'SS' : 'ss') 
      }
    

    Note:

    • Calling .Normalize('FormD')[0] on a string containing a single umlaut character in effect converts that character to its ASCII base letter; for instance, ü becomes u - see System.String.Normalize.

    In Windows PowerShell (the legacy, Windows-only edition whose latest and last version is v5.1):

    As a result, the solution is significantly more complex:

    $strings | ForEach-Object {
      $aux = 
        [regex]::Replace(
          $_,
          '[äöü](?:(?=ß)|\p{L})?',
          { 
            param($m) 
            ([string] $m.Value[0]).Normalize('FormD')[0] +
              $(if ([char]::IsUpper($(if ($m.Value[1]) { $m.Value[1] } else { $m.Value[0] }))) { 'E' } else { 'e' }) +
              $m.Value[1]
          },
          'IgnoreCase'
        )  
      [regex]::Replace(
        $aux,
        '.ß',
        { 
          param($m) 
          $m.Value[0] + $(if ([char]::IsUpper($m.Value[0])) { 'SS' } else { 'ss' }) 
        },
        'IgnoreCase'
      )  
    }
    

    Note: The above is the direct equivalent of the PowerShell (Core) 7+ solution, but the second [regex]::Replace() call could be replaced with the following, as also shown in js2010's answer:

    $aux -creplace '(?<=\p{Ll})ß', 'ss' -creplace '(?<=\p{Lu})ß', 'SS'