Search code examples
pythonscikit-learntokenizetfidfvectorizer

token-pattern for numbers in tfidfvectorizer sklearn in python


I need to calculate the tfidf matrix for few sentences. sentence include both numbers and words. I am using below code to do so

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

data1=['1/8 wire','4 tube','1-1/4 brush']
dataset=pd.DataFrame(data1, columns=['des'])
vectorizer1 = TfidfVectorizer(lowercase=False)
tf_idf_matrix = pd.DataFrame(vectorizer1.fit_transform(dataset['des']).toarray(),columns=vectorizer1.get_feature_names())

Tfidf function is considering only words as its vocabulary i.e

Out[3]: ['brush', 'tube', 'wire']

but i need numbers to be part of tokens

expected

Out[3]: ['brush', 'tube', 'wire','1/8','4','1-1/4']

After reading TfidfVectorizer documentation, I came to know have to change token_pattern and tokenizer parameters. But I am not getting how to change it to consider numbers and punctuation.

can anyone please tell me how to change the parameters.


Solution

  • You're right, token_pattern requires a custom regex pattern, pass a regex that treats any one or more characters that don't contain whitespace characters as a single token.

    tfidf = TfidfVectorizer(lowercase=False, token_pattern=r'\S+') 
    tf_idf_matrix = pd.DataFrame(
        tfidf.fit_transform(dataset['des']).toarray(), 
        columns=tfidf.get_feature_names()
    )
    

    print(tf_idf_matrix)
    
          1-1/4       1/8         4     brush      tube      wire
    0  0.000000  0.707107  0.000000  0.000000  0.000000  0.707107
    1  0.000000  0.000000  0.707107  0.000000  0.707107  0.000000
    2  0.707107  0.000000  0.000000  0.707107  0.000000  0.000000