Search code examples
pythonmachine-learningtext-classificationtf-idf

How should I go about using TF-IDF for text classification on the data I collected?


I'm working on a personal project to build a text classifier. I scraped around 3000 news articles from 8 categories. I have every single word in every article with its article's category tag in a dataframe.

The answers I saw online referred to using tfidf on entire articles/text blocks. Is there any way to analyze individual words?

Here is an idea of what my data currently looks like:

Word:       Category:

Mobile      Science/tech
Phone       Science/tech
Google      Science/tech
Facebook    Science/tech
Implant     Science/tech
Interest    Business/economy
Bank        Business/economy
IMF         Business/economy
Downturn    Business/economy
President   Politics
Donald      Politics
Trump       Politics
etc...        etc...

I apologize for the horrible formatting; I'm somewhat new to this.


Solution

  • There's no way to analyze individual words with tf-idf, and if you ask this question, I believe TF-IDF is unclear in your mind.

    I'll try to be clear about tf-idf.

    TF-IDF is a way to calculate a "score" or a "weight" of some words in a text, relative to a corpus (set of texts). This will give the words the importance they have in the text they are. So, for each text where occurs a given word, you'll have a score.

    The first part of TF-IDF is TF :

    • TF for Term-Frequency calculates makes the score of a word grow, the more it's used in a text, the bigger TF will be.

    The second part is IDF :

    • IDF for Inverse Document Frequency which is another coefficient which should be decreasing following the number of occurences where a term is repeated throughout the corpus.

    By multiplying those two coefficients, you'll have the "importance" of a word in a text, relatively to the corpus.

    Here's an example, if the word "Mobile" occurs in two texts one about Business (like the selling of Mobiles) and the other about Tech, you'll have two scores of "Mobile" in the corpus and, when you'll encounter this word in a unknown article you can sum the different scores of the words from the unknown article and you'll be able to say, pretty accurately what's the unknown article talking about.