Search code examples
pythonregextfidfvectorizer

Regular expression that accepts tokens of three or more alphabetical characters


I'm trying to build a TFIDVectorizer that only accepts tokens of 3 or more alphabetical characters using TFIdfVectorizer(token_pattern="(?u)\\b\\D\\D\\D+\\b")

But it doesn't behave correctly, I know token_pattern="(?u)\\b\\w\\w\\w+\\b" accepts tokens of 3 or more alphanumerical characters, so I just don't understand why the former is not working.

What am I missing?


Solution

  • The problem lies in using the \D metacharacter, as it's actually for matching any non-digit character, rather than any alphabetical character. From Python docs: enter image description here


    You can go instead with:
    token_pattern="(?i)[a-z]{3,}"
    

    Explanation:

    • (?i) — inline flag to make matching case-insensitive,
    • [a-z] — matches any Latin letter,
    • {3,} — makes the previous token match three or more times (greedily, i.e., as many times as possible).

    I hope this answers your question. :)