Search code examples
pythonmachine-learningcountvectorizer

CountVectorizer for number


I have a list of numbers and I want to use CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer

def x(n):
   return str(n)

sentences = [5,10,15,10,5,10]

vectorizer = CountVectorizer(preprocessor= x, analyzer="word")
vectorizer.fit(sentences)

vectorizer.vocabulary_

output:

{'10': 0, '15': 1}

and:

vectorizer.transform(sentences).toarray()

output:

array([[0, 0],
   [1, 0],
   [0, 1],
   [1, 0],
   [0, 0],
   [1, 0]], dtype=int64)

But why can't I do this for numbers less than 10?


Solution

  • This is the expected behavior. In the regex for token_pattern parameter of the CountVectorizer, it mentions:

    token_pattern  str or None, default=r”(?u)\b\w\w+\b”
    

    Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'.

    The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).

    If there is a capturing group in token_pattern then the captured group content, not the entire match, becomes the token. At most one capturing group is permitted.

    If you wish single character strings to be considered too, you just need to replace delete the first w in the regex, which will then allow, 1 and 1+ characters, by default it allows, 2 and 2+ as per the documentation.

    vectorizer = CountVectorizer(preprocessor= x, analyzer="word", token_pattern=r"(?u)\b\w+\b")
    
    
    Output: 
    {'5': 2, '10': 0, '15': 1}