What is the difference in calculating TF-IDF through Texthero:
import texthero as hero
s = pd.Series(["Sentence one", "Sentence two"])
hero.tfidf(s, return_feature_names=True)
0 [0.5797386715376657, 0.8148024746671689, 0.0]
1 [0.5797386715376657, 0.0, 0.8148024746671689]
['Sentence', 'one', 'two'])
and the TD-IDF from sklearn? I would expect the results from sklearn given these example sentences.
from sklearn.feature_extraction.text import TfidfVectorizer
...
Sentence one two
0 0.0 0.346574 0.000000
1 0.0 0.000000 0.346574
Short answer
tfidf
does not preprocess the input text and just apply the TF-IDF algorithm, whereas by default TfidfVectorizer
preprocess the input.
Functions signature
The difference lays in the way you are supposed to deal with the two frameworks.
Look at the functions signatures:
scikit-learn TfidfVectorizer
:
sklearn.feature_extraction.text.TfidfVectorizer(
*,
input='content',
encoding='utf-8',
decode_error='strict',
strip_accents=None,
lowercase=True,
preprocessor=None,
tokenizer=None,
analyzer='word',
stop_words=None,
token_pattern='(?u)\b\w\w+\b',
ngram_range=(1, 1),
max_df=1.0,
min_df=1,
max_features=None,
vocabulary=None,
binary=False,
dtype=<class 'numpy.float64'>,
norm='l2',
use_idf=True,
smooth_idf=True,
sublinear_tf=False
)
Texthero tfidf
:
tfidf(
s: pandas.core.series.Series,
max_features=None,
min_df=1,
return_feature_names=False
)
In case of scikit-learn, the different text preprocessing steps are included in the TfidfVectorizer
. In the case of the tfidf
of Texthero, there is no text preprocessing.
Your example
In your example, tf-idf values are different in the two cases as for instance TfidfVectorizer
by default convert all characters to lowercase.
Which one is better?
Depending on your task, one of the two solution might be more convenient.
If you are working with Pandas Dataframe/Series on a natural language preprocessing task and you want to have a fine-control over your code, then it's probably convenient to use tfidf
.
If, on the other hand, you are working on a more generic ML task where you also need to deal with some text and just want to quickly represent it, then you might opt for TfidfVectorizer
using the default settings.