Search code examples
pythonscikit-learnnlptf-idf

How to add threshold limit to TF-IDF values in a sparse matrix


I am using sklearn.feature_extraction.text, TfidfTransformer to get the TF_IDF values for my corpus.

This is how my code looks like

    X = dataset[:,0]
    Y = dataset[:,1]

    for index, item in enumerate(X):
        reqJson = json.loads(item, object_pairs_hook=OrderedDict)
        X[index] = json.dumps(reqJson, separators=(',', ':'))
    count_vect = CountVectorizer()
    X_train_counts = count_vect.fit_transform(X)


    tfidf_transformer = TfidfTransformer()
    X_train_tfidf = (tfidf_transformer.fit_transform(X_train_counts))

    #(58720, 167216) is the size of my sparse matrix


    for i in range (0,58720):
        for j in range (0,167216):
            print(i,j)
            if X_train_tfidf[i,j]>0.35:
                X_train_tfidf[i,j]=0

As you can see that I want to filter out tf-idf values which more than 0.35 so that I can reduce my feature set and make my model more time efficient but using a for loop just makes worse. I have looked into the documentation of TfidfTransformer but cannot find a way to make it any better. Any ideas or tips? Thank you.


Solution

  • It sounds like this question is trying to ignore frequent words.

    The TfidfVectorizer (not TfidfTransformer) implementation includes a max_df parameter for:

    When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words).

    In the following example, word1 and word3 occur in >50% of documents, so setting max_df=0.5 means the resulting array only includes word2:

    from sklearn.feature_extraction.text import TfidfVectorizer
    
    raw_data = [
        "word1 word2 word3",
        "word1 word1 word1",
        "word2 word2 word3",
        "word1 word1 word3",
    ]
    
    vect = TfidfVectorizer(max_df=0.5)
    X = vect.fit_transform(raw_data)
    
    print(vect.get_feature_names_out())
    print(X.todense())
    
    ['word2']
    [[1.]
     [0.]
     [1.]
     [0.]]