I would like to understand a bit better how TfidfVectorizer works. I don't understand how one used the subsequent functions like get_feature_name
Here is a reproducible example for my question:
from sklearn.feature_extraction.text import TfidfVectorizer
text = ['It was a queer, sultry summer', 'the summer they electrocuted the Rosenbergs',
'and I didn’t know what I was doing in New York', 'I m stupid about executions',
'The idea of being electrocuted makes me sick',
'and that’s all there was to read about in the papers',
'goggle eyed headlines staring up at me on every street corner and at the fusty',
'peanut-selling mouth of every subway', 'It had nothing to do with me',
'but I couldn’t help wondering what it would be like',
'being burned alive all along your nerves']
tfidf_vect = TfidfVectorizer(max_df=0.7,
min_df= 0.01,
use_idf=True,
ngram_range=(1,2))
tfidf_mat = tfidf_vect.fit_transform(text)
print(tfidf_mat)
features = tfidf_vect.get_feature_names()
print(features)
In this example, I thought that my object tfidf_vect
was defining all the parameters I want for my application of TfidfVectorizer
, which I then apply to text
, to obtain the results in the object tfidf_mat
.
I don't understand then why, to extract additional information of my tfidf analysis, I apply functions to the object tfidf_vect
and not to tfidf_mat
.
How does the command tfidf_vect.get_feature_names()
know that this is to be applied to text
, if that wasn't specified in its definition?
The command tfidf_vect.get_feature_names()
works because tfidf_vect
is an instance of the class TfidfVectorizer
. This class has certain attributes (see the documentation). These attributes can change after calling methods of the class, such as the method fit_transform
. Now, get_feature_names
has access to the same attributes of the class instance as the fit_transform
method. You might want to read more about classes, methods, attributes and such.
So: tfidf_mat
simply holds the return value of fit_transform()
(which is a sparse matrix of (n_samples, n_features)). After you call fit_transform()
, tfidf_vect
's attributes are changed, which can be accessed by any method of that class instance (so also by get_feature_names()
).