Search code examples
pythonpython-3.xscikit-learntfidfvectorizer

.vocabulary_ vs .get_feature_names()


These are related to the TfidfVectorizer of sklearn.

Could some explain please the similarities and differences between these two and when each one is useful.

It is quite confusing because they look very similar to each other but also quite different.

Also the rather limited sklearn documentation does not help much in this case either.


Solution

  • Basically, I think that they contain exactly the same information.

    However, if you have the name of the term and you look for the column position of it at the tf-idf matrix then you go for the .vocabulary_.

    The .vocabulary_ has as keys the names of the terms and as values their column position at the tf-idf matrix.

    Whereas, if you know the column position of the term at the tf-idf matrix and you look for its name then you go for the .get_feature_names().

    The position of the terms in the .get_feature_names() correspond to the column position of the elements to the tf-idf matrix.