Search code examples
pythontextscikit-learntf-idfn-gram

Calculate TF-IDF using sklearn for variable-n-grams in python


Problem: using scikit-learn to find the number of hits of variable n-grams of a particular vocabulary.

Explanation. I got examples from here.

Imagine I have a corpus and I want to find how many hits (counting) has a vocabulary like the following one:

myvocabulary = [(window=4, words=['tin', 'tan']),
                (window=3, words=['electrical', 'car'])
                (window=3, words=['elephant','banana'])

What I call here window is the length of the span of words in which the words can appear. as follows:

'tin tan' is hit (within 4 words)

'tin dog tan' is hit (within 4 words)

'tin dog cat tan is hit (within 4 words)

'tin car sun eclipse tan' is NOT hit. tin and tan appear more than 4 words away from each other.

I just want to count how many times (window=4, words=['tin', 'tan']) appears in a text and the same for all the other ones and then add the result to a pandas in order to calculate a tf-idf algorithm. I could only find something like this:

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(vocabulary = myvocabulary, stop_words = 'english')
tfs = tfidf.fit_transform(corpus.values())

where vocabulary is a simple list of strings, being single words or several words.

besides from scikitlearn:

class sklearn.feature_extraction.text.CountVectorizer
ngram_range : tuple (min_n, max_n)

The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

does not help neither.

Any ideas?


Solution

  • I am not sure if this can be done using CountVectorizer or TfidfVectorizer. I have written my own function for doing this as follows:

    import pandas as pd
    import numpy as np
    import string 
    
    def contained_within_window(token, word1, word2, threshold):
      word1 = word1.lower()
      word2 = word2.lower()
      token = token.translate(str.maketrans('', '', string.punctuation)).lower()
      if (word1 in token) and word2 in (token):
          word_list = token.split(" ")
          word1_index = [i for i, x in enumerate(word_list) if x == word1]
          word2_index = [i for i, x in enumerate(word_list) if x == word2]
          count = 0
          for i in word1_index:
            for j in word2_index:
              if np.abs(i-j) <= threshold:
                count=count+1
          return count
      return 0
    

    SAMPLE:

    corpus = [
        'This is the first document. And this is what I want',
        'This document is the second document.',
        'And this is the third one.',
        'Is this the first document?',
        'I like coding in sklearn',
        'This is a very good question'
    ]
    
    df = pd.DataFrame(corpus, columns=["Test"])
    

    your df will look like this:

        Test
    0   This is the first document. And this is what I...
    1   This document is the second document.
    2   And this is the third one.
    3   Is this the first document?
    4   I like coding in sklearn
    5   This is a very good question
    

    Now you can apply contained_within_window as follows:

    sum(df.Test.apply(lambda x: contained_within_window(x,word1="this", word2="document",threshold=2)))
    

    And you get:

    2
    

    You can just run a for loop for checking different instances. And you this to construct your pandas df and apply TfIdf on it, which is straight forward.