Search code examples
pythontf-idftfidfvectorizer

Texthero TD-IDF Calculation


What is the difference in calculating TF-IDF through Texthero:

import texthero as hero
s = pd.Series(["Sentence one", "Sentence two"])
hero.tfidf(s, return_feature_names=True)
0    [0.5797386715376657, 0.8148024746671689, 0.0]
1    [0.5797386715376657, 0.0, 0.8148024746671689]
['Sentence', 'one', 'two'])

and the TD-IDF from sklearn? I would expect the results from sklearn given these example sentences.

from sklearn.feature_extraction.text import TfidfVectorizer
...
Sentence    one two
0   0.0 0.346574    0.000000
1   0.0 0.000000    0.346574

Solution

  • Short answer

    tfidf does not preprocess the input text and just apply the TF-IDF algorithm, whereas by default TfidfVectorizer preprocess the input.

    Functions signature

    The difference lays in the way you are supposed to deal with the two frameworks.

    Look at the functions signatures:

    scikit-learn TfidfVectorizer:

    sklearn.feature_extraction.text.TfidfVectorizer(
        *, 
        input='content', 
        encoding='utf-8', 
        decode_error='strict', 
        strip_accents=None, 
        lowercase=True, 
        preprocessor=None, 
        tokenizer=None, 
        analyzer='word', 
        stop_words=None, 
        token_pattern='(?u)\b\w\w+\b', 
        ngram_range=(1, 1), 
        max_df=1.0, 
        min_df=1, 
        max_features=None, 
        vocabulary=None, 
        binary=False, 
        dtype=<class 'numpy.float64'>, 
        norm='l2', 
        use_idf=True, 
        smooth_idf=True, 
        sublinear_tf=False
    )
    

    Texthero tfidf:

    tfidf(
        s: pandas.core.series.Series, 
        max_features=None, 
        min_df=1, 
        return_feature_names=False
    )
    

    In case of scikit-learn, the different text preprocessing steps are included in the TfidfVectorizer. In the case of the tfidf of Texthero, there is no text preprocessing.

    Your example

    In your example, tf-idf values are different in the two cases as for instance TfidfVectorizer by default convert all characters to lowercase.

    Which one is better?

    Depending on your task, one of the two solution might be more convenient.

    If you are working with Pandas Dataframe/Series on a natural language preprocessing task and you want to have a fine-control over your code, then it's probably convenient to use tfidf.

    If, on the other hand, you are working on a more generic ML task where you also need to deal with some text and just want to quickly represent it, then you might opt for TfidfVectorizer using the default settings.