Search code examples
nlpcountvectorizer

how to consider 'punctuation ' in CountVectorizer?


I am using CountVectorizer of Sklearn to convert my strings into a vector. However, CountVectorizer by default select tokens of 2 of more characters and also ignore the punctuation and considered them as a separator. I want to consider even one character as a token and also include punctuation. For example:

aaa 1 2.75 zzz
aaa 2 3.75 www

I want a matrix of

1 1 1 0 1 1 0 
1 0 1 1 0 0 1

Is there a simple way to achieve this goal?


Solution

  • You can use a custom tokenizer as in this example:

    import re
    
    new_docs=["aaa 1 2.75 zzz","aaa 2 3.75 www"]
    
    def my_tokenizer(text):
        return re.split("\\s+",text)
    
    
    cv = CountVectorizer(new_docs,tokenizer=my_tokenizer)
    count_vector=cv.fit_transform(new_docs)
    print(cv.vocabulary_)
    

    Example output:

    {'aaa': 4, '1': 0, '2.75': 2, 'zzz': 6, '2': 1, '3.75': 3, 'www': 5}
    

    See more CountVectorizer usage examples here.