Search code examples
sqlmacrosuppercaseregexp-replacedbt

Regexp_replace collides with German umlaut ü, ö, ä


I am writing a macro in dbt with SQL to clean names. I elegantly wanted to upper the first letter of the names but my

regexp_replace('(\w)(\w*)', x -> upper(x[1]) || lower(x[2])

collides with the German umlauts ä, ö, ü

So for example the last name schöneberger with my regex expression from above becomes SchöNeberger and not Schöneberger.

Does someone know what to write so I can upper Schöneberger and other name with umlauts as well?


Solution

  • Athena uses Trino syntax, which uses Java regex syntax. Java supports the extended character classes using Unicode properties from Perl, including \p{L}, which is basically "any Unicode letter." So this will work for you:

    regexp_replace(name_col, '(\p{L})(\p{L}*)', x -> upper(x[1]) || lower(x[2]))
    

    Proof: https://regex101.com/r/N84wjS/2