I am using CountVectorizer of Sklearn to convert my strings into a vector. However, CountVectorizer by default select tokens of 2 of more characters and also ignore the punctuation and considered them as a separator. I want to consider even one character as a token and also include punctuation. For example:
aaa 1 2.75 zzz
aaa 2 3.75 www
I want a matrix of
1 1 1 0 1 1 0
1 0 1 1 0 0 1
Is there a simple way to achieve this goal?
You can use a custom tokenizer as in this example:
import re
new_docs=["aaa 1 2.75 zzz","aaa 2 3.75 www"]
def my_tokenizer(text):
return re.split("\\s+",text)
cv = CountVectorizer(new_docs,tokenizer=my_tokenizer)
count_vector=cv.fit_transform(new_docs)
print(cv.vocabulary_)
Example output:
{'aaa': 4, '1': 0, '2.75': 2, 'zzz': 6, '2': 1, '3.75': 3, 'www': 5}
See more CountVectorizer usage examples here.