I have a list of numbers and I want to use CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
def x(n):
return str(n)
sentences = [5,10,15,10,5,10]
vectorizer = CountVectorizer(preprocessor= x, analyzer="word")
vectorizer.fit(sentences)
vectorizer.vocabulary_
output:
{'10': 0, '15': 1}
and:
vectorizer.transform(sentences).toarray()
output:
array([[0, 0],
[1, 0],
[0, 1],
[1, 0],
[0, 0],
[1, 0]], dtype=int64)
But why can't I do this for numbers less than 10?
This is the expected behavior. In the regex for token_pattern
parameter of the CountVectorizer
, it mentions:
token_pattern str or None, default=r”(?u)\b\w\w+\b”
Regular expression denoting what constitutes a “token”, only used if
analyzer == 'word'
.The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).
If there is a capturing group in token_pattern then the captured group content, not the entire match, becomes the token. At most one capturing group is permitted.
If you wish single character strings to be considered too, you just need to replace delete the first w
in the regex, which will then allow, 1 and 1+ characters, by default it allows, 2 and 2+ as per the documentation.
vectorizer = CountVectorizer(preprocessor= x, analyzer="word", token_pattern=r"(?u)\b\w+\b")
Output:
{'5': 2, '10': 0, '15': 1}