Search code examples
pythonscikit-learnsearch-enginetf-idftfidfvectorizer

Using ScikitLearn TfidfVectorizer in a search engine


I'm looking at creating a search engine were I can fetch sentences (which represente a document) from a preprocessed pdf file using keywords.

I'm wondering if there is a built-in function in scikit-learn to show the data similarly to a bag of word output, meaning I'd have all the words as columns (in pandas), all documents as rows, and the tf-idf values as values


Solution

  • You can certainly do it in toy problems and for educational purposes only, but it is completely impractical and highly not advisable for real ones.

    The reason is that such term-document matrices are sparse (i.e. most of their entries are actually 0's), and this sparsity is used for their efficient storage in appropriate data structures. Converting them to non-sparse structures (i.e. pandas dataframes) would most probably overwhelm the memory of your machine; quoting from the relevant scikit-learn docs:

    As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have many feature values that are zeros (typically more than 99% of them).

    For instance a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.

    In order to be able to store such a matrix in memory but also to speed up algebraic operations matrix / vector, implementations will typically use a sparse representation such as the implementations available in the scipy.sparse package.

    That said, you can do it for educational purposes; here is how, adapting the example in the TfidfVectorizer docs:

    from sklearn.feature_extraction.text import TfidfVectorizer
    import pandas as pd
    
    corpus = [
    ...     'This is the first document.',
    ...     'This document is the second document.',
    ...     'And this is the third one.',
    ...     'Is this the first document?',
    ... ]
    
    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform(corpus)
    
    df = pd.DataFrame.sparse.from_spmatrix(X, columns = vectorizer.get_feature_names())
    df
    # result:
    
    
        and         document    first       is          one         second      the         third       this
    0   0.000000    0.469791    0.580286    0.384085    0.000000    0.000000    0.384085    0.000000    0.384085
    1   0.000000    0.687624    0.000000    0.281089    0.000000    0.538648    0.281089    0.000000    0.281089
    2   0.511849    0.000000    0.000000    0.267104    0.511849    0.000000    0.267104    0.511849    0.267104
    3   0.000000    0.469791    0.580286    0.384085    0.000000    0.000000    0.384085    0.000000    0.384085