Search code examples
javaregexunicodecharacter-encodingreplaceall

Stripping characters using a regex fails using literal characters with diacritics, apostrophes, accents, and the like


I'm trying to generate a string that conforms to a pattern in an XSD. To strip any characters that don't appear in the XSD pattern, I'm doing the following (the replaceAll call literally copied from my code):

import java.lang.String;

public class HelloWorld {
    public static void main(String[] args) {
        test("Führ");
    }

    private static void test( String name ) {
        name = name.toUpperCase( );
        name = name.replaceAll (
            "[^A-ZА-ЯΑ-ΩÄÀÁÂÃÅǍĄĂÆÇĆĈČĎĐÐÈÉÊËĚĘĜĢĞĤÌÍÎÏĴĶĹĻŁĽÑŃŇÖÒÓÔÕŐØŒŔŘẞŚŜŞŠȘŤŢÞȚÜÙÚÛŰŨŲŮŴÝŸŶŹŽŻ, '\\-–]", 
            ""
        );
        System.out.println(name);
    }
}

This fragment runs fine, and prints out "FÜHR". However, in the environment I'm running on, with exactly the same replaceAll statement, the replaceAll call removes the Ü character and prints out FHR with data (i.e. the name) coming from a database and starting with the same characters as in the code snippet ("Führ").

I'm puzzled... what could be the cause, and how can I fix this?


PS: The encoding of the source file is UTF-8 (Eclipse .settings: encoding//<<<src-path>>>.java=UTF-8)


Solution

  • Apparantly, when matching characters with diacritics, apostrophes, accents, and the like, one should specify the characters using the unicode single code point.

    For instance, for the à character, the regex should specify \u00E0 and not the literal à. The reason being, that the à character can be encoded in two ways:

    • The à character as a single-code-point (the literal à)
    • The à character as a double-code-point (a followed by the accent `)

    Specifying the unicode code point \u00E0 in the regex will match both encodings of à. Specifying the literal à in the regex will only match the way that character is encoded in your code fragment, and if it is encoded as a double-code-point, it will not match the single-code-point version of the same character.

    Rewriting the regex using the unicode single code points solved the problem. For the Ü character as in the question, the regex should specify \u00DC. This matches both the single-code-point and double-code-point encoding of Ü.

    I found the information that led to a solution here: Regex Tutorial - Unicode Characters and Properties (paragraph: Matching a Specific Code Point).