text machine-learning classification naivebayes

Text classification/Machine learning: do I also need a 'Default' categorie?

For my assignment I need to make a Machine Learning program which does the following:

As input the program gets a building plan (written in text in PDF) for a project, mainly bridges and sluices. The machine learning program takes every sentence in that PDF as a sample (the words in that sentence are the features), and needs to classify every sample/sentence in one of the following categories: Hardware related and Software related. (I use the Naive Bayes algorithm in combination with TF-DIF. )

However, as you can imagine, there are also a lot of irrelevant sentences which are neither hardware nor software related. Do I need to make a seperate categorie 'Default/Irrelevant', so that I have three categories in total? Or is it better to keep only the two categories, and classify them based on their probability? For example; a sentence is classified as hardware at 0.6, then I ignore it. But if the outcome is 0.8 or higher, then I classify it as hardware.

Solution

You need to use irrelevant sentences in your training set, I will explain the reason with one example:

If you have a three class classification problem you can obtain this output: Irrelevant 95% Hardware 4% Software 1%

The possibility of being Harware is 4 times the posibility of being Software. But you obviously will choose Irrelevant.

If you use a two class dataset you will obtain this output: Hardware 80% Software 20%

The posibility of being Hardware is again 4 times the posibility of being Software, but both percentages must sum 100% because the classifier thinks that both posibilities are the whole universum.

You have two different options:

1 - A 3 class classification problem (Hardware, Software, Irrelevant)

2 - Two classifiers with 2 class classification problem:

Classifier 1 -> Positive class Hardware, Negative class: Software + Irrelevant

Classifier 2 -> Positive class Software, Negative clas: Hardware + Irrelevant