I am writing a macro in dbt with SQL to clean names. I elegantly wanted to upper the first letter of the names but my
regexp_replace('(\w)(\w*)', x -> upper(x[1]) || lower(x[2])
collides with the German umlauts ä, ö, ü
So for example the last name schöneberger with my regex expression from above becomes SchöNeberger and not Schöneberger.
Does someone know what to write so I can upper Schöneberger and other name with umlauts as well?
Athena uses Trino syntax, which uses Java regex syntax. Java supports the extended character classes using Unicode properties from Perl, including \p{L}
, which is basically "any Unicode letter." So this will work for you:
regexp_replace(name_col, '(\p{L})(\p{L}*)', x -> upper(x[1]) || lower(x[2]))