Search code examples
pythontf-idfwordsword-embedding

Python Tf idf algorithm


I would like to find the most relevant words over a set of documents.

I would like to call a Tf Idf algorithm over 3 documents and return a csv file containing each word and its frequency.

After that, I will take only the ones with a high number and I will use them.

I found this implementation that does what I need https://github.com/mccurdyc/tf-idf/.

I call that jar using the subprocess library. But there is a huge problem in that code: it commits a lot of mistake in analyzing words. It mixs some words, it has problems with ' and - (I think). I am using it over the text of 3 books (Harry Potter) and , for example, I am obtaining words such hermiones, hermionell, riddlehermione, thinghermione instead of just hermione in the csv file.

Am I doing wrong something? Can you give me a working implementation of the Tf idf algorithm? Is there a python library that does that?


Solution

  • Here is an implementation of the Tf-idf algorithm using scikit-learn. Before applying it, you can word_tokenize() and stem your words.

    import pandas as pd
    from sklearn.feature_extraction.text import TfidfVectorizer
    from nltk import word_tokenize
    from nltk.stem.porter import PorterStemmer
    
    def tokenize(text):
        tokens = word_tokenize(text)
        stems = []
        for item in tokens: stems.append(PorterStemmer().stem(item))
        return stems
    
    # your corpus
    text = ["This is your first text book", "This is the third text for analysis", "This is another text"]
    # word tokenize and stem
    text = [" ".join(tokenize(txt.lower())) for txt in text]
    vectorizer = TfidfVectorizer()
    matrix = vectorizer.fit_transform(text).todense()
    # transform the matrix to a pandas df
    matrix = pd.DataFrame(matrix, columns=vectorizer.get_feature_names())
    # sum over each document (axis=0)
    top_words = matrix.sum(axis=0).sort_values(ascending=False)