I want to filter Strings from documents the same way sklearn's CountVectorizer does. It uses the following RegEx: (?u)\b\w\w+\b
.
This java code should behave the same way:
Pattern regex = Pattern.compile("(?u)\\b\\w\\w+\\b");
Matcher matcher = regex.matcher("this is the document.!? äöa m²");
while(matcher.find()) {
String match = matcher.group();
System.out.println(match);
}
But this doesnt produce the desired output, as it does in python:
this
is
the
document
äöa
m²
It instead outputs:
this
is
the
document
What can i do to include non-ascii characters, as the python RegeEx does?
As suggested by Wiktor in the comments, you could use (?U)
to turn on the flag UNICODE_CHARACTER_CLASS
. While this does allow matching äöa
, this still doesn't match m²
. That's because UNICODE_CHARACTER_CLASS
with \w
doesn't recognize ²
as a valid alphanumeric character. As a replacement for \w
, you can use [\pN\pL_]
. This matches Unicode numbers \pN
and Unicode letters \pL
(plus _
). The \pN
Unicode character class includes the \pNo
character class, which includes the Latin 1 Supplement - Latin-1 punctuation and symbols character class (it includes ²³¹
). Alternatively, you could just add the \pNo
Unicode character class to a character class with \w
. This means the following regular expressions correctly match your strings:
[\pN\pL_]{2,} # Matches any Unicode number or letter, and underscore
(?U)[\w\pNo]{2,} # Uses UNICODE_CHARACTER_CLASS so that \w matches Unicode.
# Adds \pNo to additionally match ²³¹
So why doesn't \w
match ²
in Java but it does in Python?
Looking at OpenJDK 8-b132's Pattern
implementation, we get the following information (I removed information irrelevant to answering the question):
Unicode support
The following Predefined Character classes and POSIX character classes are in conformance with the recommendation of Annex C: Compatibility Properties of Unicode Regular Expression, when
UNICODE_CHARACTER_CLASS
flag is specified.
\w
A word character:[\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}\p{IsJoin_Control}]
Great! Now we have a definition for \w
when the (?U)
flag is used. Plugging these Unicode character classes into this amazing tool will tell you exactly what each of these Unicode character classes match. Without making this post super long, I'll just go ahead and tell you that neither of the following classes matches ²
:
\p{Alpha}
\p{gc=Mn}
\p{gc=Me}
\p{gc=Mc}
\p{Digit}
\p{gc=Pc}
\p{IsJoin_Control}
So why does Python match ²³¹
when the u
flag is used in conjunction with \w
? This one was very difficult to track down, but I went digging into Python's source code (I used Python 3.6.5rc1 - 2018-03-13). After removing a lot of the fluff for how this gets called, basically the following happens:
\w
is defined as CATEGORY_UNI_WORD
, which is then prefixed with SRE_
. SRE_CATEGORY_UNI_WORD
calls SRE_UNI_IS_WORD(ch)
SRE_UNI_IS_WORD
is defined as (SRE_UNI_IS_ALNUM(ch) || (ch) == '_')
.SRE_UNI_IS_ALNUM
calls Py_UNICODE_ISALNUM
, which is, in turn, defined as (Py_UNICODE_ISALPHA(ch) || Py_UNICODE_ISDECIMAL(ch) || Py_UNICODE_ISDIGIT(ch) || Py_UNICODE_ISNUMERIC(ch))
.Py_UNICODE_ISDECIMAL(ch)
, defined as Py_UNICODE_ISDECIMAL(ch) _PyUnicode_IsDecimalDigit(ch)
.Now, let's take a look at the method _PyUnicode_IsDecimalDigit(ch)
:
int _PyUnicode_IsDecimalDigit(Py_UCS4 ch)
{
if (_PyUnicode_ToDecimalDigit(ch) < 0)
return 0;
return 1;
}
As we can see, this method returns 1
if _PyUnicode_ToDecimalDigit(ch) < 0
. So what does _PyUnicode_ToDecimalDigit
look like?
int _PyUnicode_ToDecimalDigit(Py_UCS4 ch)
{
const _PyUnicode_TypeRecord *ctype = gettyperecord(ch);
return (ctype->flags & DECIMAL_MASK) ? ctype->decimal : -1;
}
Great, so basically, if the character's UTF-32 encoded byte has the DECIMAL_MASK
flag this will evaluate to true and a value greater than or equal to 0
will be returned.
UTF-32 encoded byte value for ²
is 0x000000b2
and our flag DECIMAL_MASK
is 0x02
. 0x000000b2 & 0x02
evaluates to true and so ²
is deemed to be a valid Unicode alphanumeric character in python, thus \w
with u
flag matches ²
.