java regex scikit-learn pattern-matching countvectorizer

Java regex doesnt match outside of ascii range, behaves different than python regex

I want to filter Strings from documents the same way sklearn's CountVectorizer does. It uses the following RegEx: (?u)\b\w\w+\b. This java code should behave the same way:

Pattern regex = Pattern.compile("(?u)\\b\\w\\w+\\b");
Matcher matcher = regex.matcher("this is the document.!? äöa m²");

while(matcher.find()) {
    String match = matcher.group();
    System.out.println(match);
}

But this doesnt produce the desired output, as it does in python:

this
is
the
document
äöa
m²

It instead outputs:

this
is
the
document

What can i do to include non-ascii characters, as the python RegeEx does?

Solution

As suggested by Wiktor in the comments, you could use (?U) to turn on the flag UNICODE_CHARACTER_CLASS. While this does allow matching äöa, this still doesn't match m². That's because UNICODE_CHARACTER_CLASS with \w doesn't recognize ² as a valid alphanumeric character. As a replacement for \w, you can use [\pN\pL_]. This matches Unicode numbers \pN and Unicode letters \pL (plus _). The \pN Unicode character class includes the \pNo character class, which includes the Latin 1 Supplement - Latin-1 punctuation and symbols character class (it includes ²³¹). Alternatively, you could just add the \pNo Unicode character class to a character class with \w. This means the following regular expressions correctly match your strings:

[\pN\pL_]{2,}         # Matches any Unicode number or letter, and underscore
(?U)[\w\pNo]{2,}      # Uses UNICODE_CHARACTER_CLASS so that \w matches Unicode.
                      # Adds \pNo to additionally match ²³¹

So why doesn't \w match ² in Java but it does in Python?

Java's interpretation

Looking at OpenJDK 8-b132's Pattern implementation, we get the following information (I removed information irrelevant to answering the question):

Unicode support

The following Predefined Character classes and POSIX character classes are in conformance with the recommendation of Annex C: Compatibility Properties of Unicode Regular Expression, when UNICODE_CHARACTER_CLASS flag is specified.

\w A word character: [\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}\p{IsJoin_Control}]

Great! Now we have a definition for \w when the (?U) flag is used. Plugging these Unicode character classes into this amazing tool will tell you exactly what each of these Unicode character classes match. Without making this post super long, I'll just go ahead and tell you that neither of the following classes matches ²:

\p{Alpha}
\p{gc=Mn}
\p{gc=Me}
\p{gc=Mc}
\p{Digit}
\p{gc=Pc}
\p{IsJoin_Control}

Python's interpretation

So why does Python match ²³¹ when the u flag is used in conjunction with \w? This one was very difficult to track down, but I went digging into Python's source code (I used Python 3.6.5rc1 - 2018-03-13). After removing a lot of the fluff for how this gets called, basically the following happens:

\w is defined as CATEGORY_UNI_WORD, which is then prefixed with SRE_. SRE_CATEGORY_UNI_WORD calls SRE_UNI_IS_WORD(ch)
SRE_UNI_IS_WORD is defined as (SRE_UNI_IS_ALNUM(ch) || (ch) == '_').
SRE_UNI_IS_ALNUM calls Py_UNICODE_ISALNUM, which is, in turn, defined as (Py_UNICODE_ISALPHA(ch) || Py_UNICODE_ISDECIMAL(ch) || Py_UNICODE_ISDIGIT(ch) || Py_UNICODE_ISNUMERIC(ch)).
The important one here is Py_UNICODE_ISDECIMAL(ch), defined as Py_UNICODE_ISDECIMAL(ch) _PyUnicode_IsDecimalDigit(ch).

Now, let's take a look at the method _PyUnicode_IsDecimalDigit(ch):

int _PyUnicode_IsDecimalDigit(Py_UCS4 ch)
{
    if (_PyUnicode_ToDecimalDigit(ch) < 0)
        return 0;
    return 1;
}

As we can see, this method returns 1 if _PyUnicode_ToDecimalDigit(ch) < 0. So what does _PyUnicode_ToDecimalDigit look like?

int _PyUnicode_ToDecimalDigit(Py_UCS4 ch)
{
    const _PyUnicode_TypeRecord *ctype = gettyperecord(ch);

    return (ctype->flags & DECIMAL_MASK) ? ctype->decimal : -1;
}

Great, so basically, if the character's UTF-32 encoded byte has the DECIMAL_MASK flag this will evaluate to true and a value greater than or equal to 0 will be returned.

UTF-32 encoded byte value for ² is 0x000000b2 and our flag DECIMAL_MASK is 0x02. 0x000000b2 & 0x02 evaluates to true and so ² is deemed to be a valid Unicode alphanumeric character in python, thus \w with u flag matches ².