I'm working on a personal project to build a text classifier. I scraped around 3000 news articles from 8 categories. I have every single word in every article with its article's category tag in a dataframe.
The answers I saw online referred to using tfidf on entire articles/text blocks. Is there any way to analyze individual words?
Here is an idea of what my data currently looks like:
Word: Category:
Mobile Science/tech
Phone Science/tech
Google Science/tech
Facebook Science/tech
Implant Science/tech
Interest Business/economy
Bank Business/economy
IMF Business/economy
Downturn Business/economy
President Politics
Donald Politics
Trump Politics
etc... etc...
I apologize for the horrible formatting; I'm somewhat new to this.
There's no way to analyze individual words with tf-idf, and if you ask this question, I believe TF-IDF is unclear in your mind.
I'll try to be clear about tf-idf.
TF-IDF is a way to calculate a "score" or a "weight" of some words in a text, relative to a corpus (set of texts). This will give the words the importance they have in the text they are. So, for each text where occurs a given word, you'll have a score.
The first part of TF-IDF is TF :
The second part is IDF :
By multiplying those two coefficients, you'll have the "importance" of a word in a text, relatively to the corpus.
Here's an example, if the word "Mobile" occurs in two texts one about Business (like the selling of Mobiles) and the other about Tech, you'll have two scores of "Mobile" in the corpus and, when you'll encounter this word in a unknown article you can sum the different scores of the words from the unknown article and you'll be able to say, pretty accurately what's the unknown article talking about.