Regular expression that accepts tokens of three or more alphabetical characters

I'm trying to build a TFIDVectorizer that only accepts tokens of 3 or more alphabetical characters using TFIdfVectorizer(token_pattern="(?u)\\b\\D\\D\\D+\\b")

But it doesn't behave correctly, I know token_pattern="(?u)\\b\\w\\w\\w+\\b" accepts tokens of 3 or more alphanumerical characters, so I just don't understand why the former is not working.

What am I missing?

Solution

The problem lies in using the \D metacharacter, as it's actually for matching any non-digit character, rather than any alphabetical character. From Python docs:

You can go instead with:

token_pattern="(?i)[a-z]{3,}"

Explanation:

(?i) — inline flag to make matching case-insensitive,
[a-z] — matches any Latin letter,
{3,} — makes the previous token match three or more times (greedily, i.e., as many times as possible).

I hope this answers your question. :)