I'm trying to generate a string that conforms to a pattern in an XSD. To strip any characters that don't appear in the XSD pattern, I'm doing the following (the replaceAll
call literally copied from my code):
import java.lang.String;
public class HelloWorld {
public static void main(String[] args) {
test("Führ");
}
private static void test( String name ) {
name = name.toUpperCase( );
name = name.replaceAll (
"[^A-ZА-ЯΑ-ΩÄÀÁÂÃÅǍĄĂÆÇĆĈČĎĐÐÈÉÊËĚĘĜĢĞĤÌÍÎÏĴĶĹĻŁĽÑŃŇÖÒÓÔÕŐØŒŔŘẞŚŜŞŠȘŤŢÞȚÜÙÚÛŰŨŲŮŴÝŸŶŹŽŻ, '\\-–]",
""
);
System.out.println(name);
}
}
This fragment runs fine, and prints out "FÜHR". However, in the environment I'm running on, with exactly the same replaceAll
statement, the replaceAll
call removes the Ü
character and prints out FHR
with data (i.e. the name) coming from a database and starting with the same characters as in the code snippet ("Führ").
I'm puzzled... what could be the cause, and how can I fix this?
PS: The encoding of the source file is UTF-8 (Eclipse .settings: encoding//<<<src-path>>>.java=UTF-8
)
Apparantly, when matching characters with diacritics, apostrophes, accents, and the like, one should specify the characters using the unicode single code point.
For instance, for the à character, the regex should specify \u00E0
and not the literal à
. The reason being, that the à character can be encoded in two ways:
Specifying the unicode code point \u00E0
in the regex will match both encodings of à. Specifying the literal à
in the regex will only match the way that character is encoded in your code fragment, and if it is encoded as a double-code-point, it will not match the single-code-point version of the same character.
Rewriting the regex using the unicode single code points solved the problem. For the Ü character as in the question, the regex should specify \u00DC
. This matches both the single-code-point and double-code-point encoding of Ü.
I found the information that led to a solution here: Regex Tutorial - Unicode Characters and Properties (paragraph: Matching a Specific Code Point).