Search code examples
machine-learningtext-classificationsupervised-learningmulticlass-classification

Searching for list of terms using Google in order to build a bag-of-words for a particular category


I am having a hard time understanding the process of building a bag-of-words. This will be a multiclass classfication supervised machine learning problem wherein a webpage or a piece of text is assigned to one category from multiple pre-defined categories. Now the method that I am familiar with when building a bag of words for a specific category (for example, 'Math') is to collect a lot of webpages that are related to Math. From there, I would perform some data processing (such as remove stop words and performing TF-IDF) to obtain the bag-of-words for the category 'Math'.

Question: Another method that I am thinking of is to instead search in google for something like 'List of terms related to Math' to build my bag-of-words. I would like to ask if this is method is okay?

Another question: In the context of this question, does bag-of-words and corpus mean the same thing?

Thank you in advance!


Solution

  • This is not what bag of words is. Bag of words is the term to describe a specific way of representing a given document. Namely, a document (paragraph, sentence, webpage) is represented as a mapping of form

    word: how many times this word is present in a document
    

    for example "John likes cats and likes dogs" would be represented as: {john: 1, likes: 2, cats: 1, and: 1, dogs: 1}. This kind of representation can be easily fed into typical ML methods (especially if one assumes that total vocabulary is finite so we end up with numeric vectors).

    Note, that this is not about "creating a bag of words for a category". Category, in typical supervised learning would consist of multiple documents, and each of them independently is represented as a bag of words.

    In particular this invalidates your final proposal of asking google for words that are related to category - this is not how typical ML methods work. You get a lot of documents, represent them as bag of words (or something else) and then perform statistical analysis (build a model) to figure out the best set of rules to discriminate between categories. These rules usually will not be simply "if the word X is present, this is related to Y".