Search code examples
pythonscikit-learnnlptf-idf

Sci-kit TF-IDF - Unsure of Interpretation of TD-IDF Array?


I have a subset of a dataframe like:

<OUT>
PageNumber    Top_words_only
56            people sun flower festival 
75            sunflower sun architecture red buses festival

I want to calculate TF-IDF on the English_tags df column with each row acting as a document. I have tried:

Vectorizer = TfidfVectorizer(lowercase = True, max_df = 0.8, min_df = 5, stop_words = 'english')
Vectors = Vectorizer.fit_transform(df['top_words_only'])

If I print the array it comes out as:

array([[0.        , 0.        , 0.        , ..., 0.        , 0.35588179,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

But I am a little confused by what this means - why are there so many o values? Does implementing TfidfVectorizer() automatically calculate the TF-IDF values for each tag taking into account all documents (i.e. corpus)?


Solution

  • Calling fit_transform calculates a vector for each supplied document. Each vector will be the same size. The size of the vector is the number of unique words across the supplied documents. The number of zero values in the vector will be the vector size - number of unique values in the document.

    Using your top_words as a simple example. You show 2 documents:

    'people sun flower festival'
    'sunflower sun architecture red buses festival'
    

    These have a total of 8 unique words (Vectorizer.get_feature_names_out() will give you these):

    'architecture', 'buses', 'festival', 'flower', 'people', 'red', 'sun', 'sunflower'
    

    Calling fit_transform with those 2 documents will give 2 vectors (1 for each doc), each with length 8 (number of unique words across the documents).

    The first document, 'people sun flower festival' has 4 words, so, you get 4 values in the vector, and 4 zeros. Similarly 'sunflower sun architecture red buses festival' gives 6 values and 2 zeros.

    The more documents you pass in with different words, the longer the vector gets, and the more likely the zeros are.

    from sklearn.feature_extraction.text import TfidfVectorizer
    
    top_words = ['people sun flower festival', 'sunflower sun architecture red buses festival']
    
    Vectorizer = TfidfVectorizer()
    Vectors = Vectorizer.fit_transform(top_words)
    
    print(f'Feature names: {Vectorizer.get_feature_names_out().tolist()}')
    tfidf = Vectors.toarray()
    print('')
    print(f'top_words[0] = {top_words[0]}')
    print(f'tfidf[0] = {tfidf[0].tolist()}')
    print('')
    print(f'top_words[1] = {top_words[1]}')
    print(f'tfidf[1] = {tfidf[1].tolist()}')
    

    The above code will print:

    Feature names: ['architecture', 'buses', 'festival', 'flower', 'people', 'red', 'sun', 'sunflower']
    
    top_words[0] = people sun flower festival
    tfidf[0] = [0.0, 0.0, 0.40993714596036396, 0.5761523551647353, 0.5761523551647353, 0.0, 0.40993714596036396, 0.0]
    
    top_words[1] = sunflower sun architecture red buses festival
    tfidf[1] = [0.4466561618018052, 0.4466561618018052, 0.31779953783628945, 0.0, 0.0, 0.4466561618018052, 0.31779953783628945, 0.4466561618018052]