Search code examples
pythonmachine-learningnlpspacytextacy

Calculate TD-IDF for a single word in Textacy


I'm trying to use Textacy to calculate the TF-IDF score for a single word across the standard corpus, but am a bit unclear about the result I am receiving.

I was expecting a single float which represented the frequency of the word in the corpus. So why am I receiving a list (?) of 7 results?

"acculer" is actually a French word, so was expecting a result of 0 from an English corpus.

word = 'acculer'
vectorizer = textacy.Vectorizer(tf_type='linear', apply_idf=True, idf_type='smooth')
tf_idf = vectorizer.fit_transform(word)
logger.info("tf_idf:")
logger.info(tfidf)

Output

tf_idf:
(0, 0)  2.386294361119891
(1, 1)  1.9808292530117262
(2, 1)  1.9808292530117262
(3, 5)  2.386294361119891
(4, 3)  2.386294361119891
(5, 2)  2.386294361119891
(6, 4)  2.386294361119891

The second part of the question is how can I provide my own corpus to the TF-IDF function in Textacy, esp. one in a different language?

EDIT

As mentioned by @Vishal I have logged the ouput using this line:

logger.info(vectorizer.vocabulary_terms)

It seems the provided word acculer has been split into characters.

{'a': 0, 'c': 1, 'u': 5, 'l': 3, 'e': 2, 'r': 4}

(1) How can I get the TF-IDF for this word against the corpus, rather than each character?

(2) How can I provide my own corpus and point to it as a param?

(3) Can TF-IDF be used at a sentence level? ie: what is the relative frequency of this sentence's terms against the corpus.


Solution

  • Fundamentals

    Lets get definitions clear before looking into the actual questions.

    Assume our corpus contains 3 documents (d1, d2 and d3 respectively):

    corpus = ["this is a red apple", "this is a green apple", "this is a cat"]
    

    Term Frequency (tf)

    tf (of a word) is defined as number of times a word appears in a document.

    tf(word, document) = count(word, document) # Number of times word appears in the document
    

    tf is defined for a word at document level.

    tf('a',d1)     = 1      tf('a',d2)     = 1      tf('a',d3)     = 1
    tf('apple',d1) = 1      tf('apple',d2) = 1      tf('apple',d3) = 0
    tf('cat',d1)   = 0      tf('cat',d2)   = 0      tf('cat',d3)   = 1
    tf('green',d1) = 0      tf('green',d2) = 1      tf('green',d3) = 0
    tf('is',d1)    = 1      tf('is',d2)    = 1      tf('is',d3)    = 1
    tf('red',d1)   = 1      tf('red',d2)   = 0      tf('red',d3)   = 0
    tf('this',d1)  = 1      tf('this',d2)  = 1      tf('this',d3)  = 1
    

    Using the raw counts has a problem that the tf values of words in longer documents have high values compared to the shorter document. This problem can be solved by normalizing the raw count values by dividing by the document length (number of words in the corresponding document). This is called l1 normalization. The document d1 can now be represented by the tf vector with all tf values of all the words in the vocubulary of the corpus. There is an another kind of normalizaiton called l2 which makes the l2 norm of the tf vector of the document equal to 1.

    tf(word, document, normalize='l1') = count(word, document)/|document|
    tf(word, document, normalize='l2') = count(word, document)/l2_norm(document)
    
    |d1| = 5, |d2| = 5, |d3| = 4
    l2_norm(d1) = 0.447, l2_norm(d2) = 0.447, l2_norm(d3) = 0.5, 
    

    Code : tf

    corpus = ["this is a red apple", "this is a green apple", "this is a cat"]
    # Convert docs to textacy format
    textacy_docs = [textacy.Doc(doc) for doc in corpus]
    
    for norm in [None, 'l1', 'l2']:
        # tokenize the documents
        tokenized_docs = [
        doc.to_terms_list(ngrams=1, named_entities=True, as_strings=True, filter_stops=False, normalize='lower')
        for doc in textacy_docs]
    
        # Fit the tf matrix 
        vectorizer = textacy.Vectorizer(apply_idf=False, norm=norm)
        doc_term_matrix = vectorizer.fit_transform(tokenized_docs)
    
        print ("\nVocabulary: ", vectorizer.vocabulary_terms)
        print ("TF with {0} normalize".format(norm))
        print (doc_term_matrix.toarray())
    

    Output:

    Vocabulary:  {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
    TF with None normalize
    [[1 1 0 0 1 1 1]
     [1 1 0 1 1 0 1]
     [1 0 1 0 1 0 1]]
    
    Vocabulary:  {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
    TF with l1 normalize
    [[0.2  0.2  0.   0.   0.2  0.2  0.2 ]
     [0.2  0.2  0.   0.2  0.2  0.   0.2 ]
     [0.25 0.   0.25 0.   0.25 0.   0.25]]
    
    Vocabulary:  {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
    TF with l2 normalize
    [[0.4472136 0.4472136 0.        0.        0.4472136 0.4472136 0.4472136]
     [0.4472136 0.4472136 0.        0.4472136 0.4472136 0.        0.4472136]
     [0.5       0.        0.5       0.        0.5       0.        0.5      ]]
    

    The rows in the tf matrix correspond to documents (hence 3 rows for our corpus) and columns correspond to each word in the vocabulary (index of the word shown in the vocabulary dictionary)

    Inverse Document Frequency (idf)

    Some words convey less information then others. For example words like the, a, an, this, that are very common words and they convey very less information. idf is a measure of the importance of the word. We consider a word appearing in many documents to be less informative compared to words appearing in few documents.

    idf(word, corpus) = log(|corpus| / No:of documents containing word) + 1  # standard idf
    

    For our corpus intuitively idf(apple, corpus) < idf(cat,corpus)

    idf('apple', corpus) = log(3/2) + 1 = 1.405 
    idf('cat', corpus) = log(3/1) + 1 = 2.098
    idf('this', corpus) = log(3/3) + 1 = 1.0
    

    Code : idf

    textacy_docs = [textacy.Doc(doc) for doc in corpus]    
    tokenized_docs = [
        doc.to_terms_list(ngrams=1, named_entities=True, as_strings=True, filter_stops=False, normalize='lower')
        for doc in textacy_docs]
    
    vectorizer = textacy.Vectorizer(apply_idf=False, norm=None)
    doc_term_matrix = vectorizer.fit_transform(tokenized_docs)
    
    print ("\nVocabulary: ", vectorizer.vocabulary_terms)
    print ("standard idf: ")
    print (textacy.vsm.matrix_utils.get_inverse_doc_freqs(doc_term_matrix, type_='standard'))
    

    Output:

    Vocabulary:  {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
    standard idf: 
    [1.     1.405       2.098       2.098       1.      2.098       1.]
    

    Term Frequency–Inverse Document Frequency(tf-idf): tf-idf is a a measure of how important a word is in the document in a corpus. tf of word weighted with its ids gives us the tf-idf measure of the word.

    tf-idf(word, document, corpus) = tf(word, docuemnt) * idf(word, corpus)
    
    tf-idf('apple', 'd1', corpus) = tf('apple', 'd1') * idf('apple', corpus) = 1 * 1.405 = 1.405
    tf-idf('cat', 'd3', corpus) = tf('cat', 'd3') * idf('cat', corpus) = 1 * 2.098 = 2.098
    

    Code : tf-idf

    textacy_docs = [textacy.Doc(doc) for doc in corpus]
    
    tokenized_docs = [
        doc.to_terms_list(ngrams=1, named_entities=True, as_strings=True, filter_stops=False, normalize='lower')
        for doc in textacy_docs]
    
    print ("\nVocabulary: ", vectorizer.vocabulary_terms)
    print ("tf-idf: ")
    
    vectorizer = textacy.Vectorizer(apply_idf=True, norm=None, idf_type='standard')
    doc_term_matrix = vectorizer.fit_transform(tokenized_docs)
    print (doc_term_matrix.toarray())
    

    Output:

    Vocabulary:  {'this': 6, 'is': 4, 'a': 0, 'red': 5, 'apple': 1, 'green': 3, 'cat': 2}
    tf-idf: 
    [[1.         1.405   0.         0.         1.         2.098   1.        ]
     [1.         1.405   0.         2.098      1.         0.      1.        ]
     [1.         0.      2.098      0.         1.         0.      1.        ]]
    

    Now coming to the questions:

    (1) How can I get the TF-IDF for this word against the corpus, rather than each character?

    As seen above, there is no tf-idf defined independently, tf-idf of a word is with respect to a document in a corpus.

    (2) How can I provide my own corpus and point to it as a param?

    It is shown in the above samples.

    1. Convert the text documents into textacy Docs using textacy.Doc api
    2. Tokenzie the textacy.Doc's using to_terms_list method. (Using this method you can use add unigram, bigram or trigram into the vocabulary, filter out stop wordsm noramalize text etc)
    3. Use textacy.Vectorizer to create the term matrix from the tokenized documents. The term matrix returned is
      • tf (raw counts): apply_idf=False, norm=None
      • tf (l1 normalized): apply_idf=False, norm='l1'
      • tf (l2 normalized): apply_idf=False, norm='l2'
      • tf-idf (standard): apply_idf=True, idf_type='standard'

    (3) Can TF-IDF be used at a sentence level? ie: what is the relative frequency of this sentence's terms against the corpus.

    Yes you can, if and only if you treat each sentence as a separate document. In such a case the tf-idf vector (full row) of the corresponding document can be treated as a vector representation of the document (which is a single sentence in your case).

    In case of our corpus (which infact contains a single sentence per document), the vector representation of d1 and d2 should be close as compared to vectors d1 and d3. Lets check cosin similarity and see :

    cosine_similarity(doc_term_matrix)
    

    Output

    array([[1.        ,     0.53044716,     0.35999211],
           [0.53044716,     1.        ,     0.35999211],
           [0.35999211,     0.35999211,     1.        ]])
    

    As you can see cosine_similarity(d1,d2) = 0.53 and cosine_similarity(d1,d3) = 0.35, so indeed d1 and d2 are more similar then d1 and d3 (1 being exactly similar and 0 being not similar - orthogonal vectors).

    Once you train your Vectorizer you can pickle the trained object to a disk for later usage.

    Conclusion

    tf of a word is at document level, idfof a word is at corpus level and tf-idf of a word is at document with respect to the corpus. They are well suited for vector representation of a document (or a sentence when a document is made up of a single sentence). If you are interested in vector representation of words, then explore word embedding like (word2vec, fasttext, glove etc).