Search code examples
javaregexunicodeposix

Relationship between Alnum and IsAlphabetic character classes in Java RegEx patterns


Looking at the Javadoc for java.util.regex.Pattern

\p{Alnum} An alphanumeric character:[\p{IsAlphabetic}\p{IsDigit}]

it appears that every character that matches \p{IsAlphabetic} should also match \p{Alnum}

However, it does not seem to be the case when the character has an accent. For example, the following assertion fails:

assertEquals("é".matches("\\p{IsAlphabetic}+"),"é".matches("\\p{Alnum}+"));

The same thing happens for other characters with accents such as ą, ó, ł, ź ż. All match \p{IsAlphabetic}+ but not \p{Alnum}+

Am I mis-interpreting the Javadoc? Or is this a bug in the documentation or implementation?


Solution

  • By default \p{Alnum} is treated as a POSIX character class which means it will only ever match ASCII characters. This means it will match a and 1 but not ä or ١.

    The passage you quote only applies when the UNICODE_CHARACTER_CLASS flag is used.

    Slightly oversimplified, this flag will turn the "old" POSIX style character classes into their equivalent Unicode character classes.