I need to calculate the tfidf matrix for few sentences. sentence include both numbers and words. I am using below code to do so
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
data1=['1/8 wire','4 tube','1-1/4 brush']
dataset=pd.DataFrame(data1, columns=['des'])
vectorizer1 = TfidfVectorizer(lowercase=False)
tf_idf_matrix = pd.DataFrame(vectorizer1.fit_transform(dataset['des']).toarray(),columns=vectorizer1.get_feature_names())
Tfidf function is considering only words as its vocabulary i.e
Out[3]: ['brush', 'tube', 'wire']
but i need numbers to be part of tokens
expected
Out[3]: ['brush', 'tube', 'wire','1/8','4','1-1/4']
After reading TfidfVectorizer documentation, I came to know have to change token_pattern and tokenizer parameters. But I am not getting how to change it to consider numbers and punctuation.
can anyone please tell me how to change the parameters.
You're right, token_pattern
requires a custom regex pattern, pass a regex that treats any one or more characters that don't contain whitespace characters as a single token.
tfidf = TfidfVectorizer(lowercase=False, token_pattern=r'\S+')
tf_idf_matrix = pd.DataFrame(
tfidf.fit_transform(dataset['des']).toarray(),
columns=tfidf.get_feature_names()
)
print(tf_idf_matrix)
1-1/4 1/8 4 brush tube wire
0 0.000000 0.707107 0.000000 0.000000 0.000000 0.707107
1 0.000000 0.000000 0.707107 0.000000 0.707107 0.000000
2 0.707107 0.000000 0.000000 0.707107 0.000000 0.000000