Search code examples
pythonscikit-learntf-idftfidfvectorizercountvectorizer

Reduce Dimension of word-vectors from TFIDFVectorizer / CountVectorizer


I want to use the TFIDFVectorizer (or CountVectorizer followed by TFIDFTransformer) to get a vector representation of my terms. That means, I want a vector for a term where the documents are the features. That's simply the transpose of a TF-IDF matrix created by the TFIDFVectorizer.

>>> vectorizer = TfidfVectorizer()
>>> model = vectorizer.fit_transform(corpus)
>>> model.transpose()

However, I have 800k documents which mean my term vectors are very sparse and very large (800k dimensions). The flag max_features in the CountVectorizer would do exactly what I'm looking for. I can specify a dimension and the CountVectorizer tries to fit all information into this dimension. Unfortunately, this option is for the document vectors rather than the terms in the vocabulary. Hence, it reduces the size of my vocabulary because the terms are the features.

Is there any way to do the opposite? Like, perform a transpose on the TFIDFVectorizer object before it starts cutting and normalizing everything? And if such an approach exists, how can I do that? Something like this:

>>> countVectorizer = CountVectorizer(input='filename', max_features=300, transpose=True)

I was looking for such an approach for a while now but every guide, code example, whatever is talking about the document TF-IDF vectors rather than the term vectors. Thank you so much in advance!


Solution

  • I am not aware of any straight forward way to do this but let me propose a way how this could be achieved.

    You are trying to represent each term in your corpus as a vector that uses the documents in your corpus as its component features. Because the number of documents (which are the features in your case) is very large, you would like to limit them in a way similar to what max_features does.

    According to CountVectorizer user guide (same for the TfidfVectorizer):

    max_features int, default=None

    If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

    In a similar way, you want to keep the top documents ordered by their "frequency across the terms", as confusing as this may sound. This could be rephrased simplistically as "keep those documents that contain the most unique terms".

    One way I can think of doing that is by using the inverse_transform performing the following steps:

        vectorizer = TfidfVectorizer()
        model = vectorizer.fit_transform(corpus)
        
        # We use the inverse_transform which returns the 
        # terms per document with nonzero entries
        inverse_model = vectorizer.inverse_transform(model)
        
        # Each line in the inverse model corresponds to a document 
        # and contains a list of feature names (the terms).
        # As we want to rank the documents we tranform the list 
        # of feature names to a number of features
        # that each document is represented by.
        inverse_model_count = list(map(lambda doc_vec: len(doc_vec), inverse_model))
        
        # As we are going to sort the list, we need to keep track of the 
        # document id (its index in the corpus), so we create tuples with 
        # the list index of each item before we sort the list.
        inverse_model_count_tuples = list(zip(range(len(inverse_model_count)),
                                              inverse_model_count))
        
        # Then we sort the list by the count of terms 
        # in each document (the second component)
        max_features = 100
        top_documents_tuples = sorted(inverse_model_count_tuples, 
                                      key=lambda item: item[1], 
                                      reverse=True)[:max_features]
        
        # We are interested only in the document ids (the first tuple component)
        top_documents, _ = zip(*top_documents_tuples)
        
        # Having the top_documents ids we can slice the initial model 
        # to keep only the documents indicated by the top_documents list
        reduced_model = model[top_documents]
    

    Please note that this approach only takes into account the number of terms per document, no matter what is their count (CountVectorizer) or weight (TfidfVectorizer). If the direction of this approach is acceptable for you then with some more code it could be possible to also take into account the count or weight of the terms.

    I hope this helps!