For my assignment I need to make a Machine Learning program which does the following:
As input the program gets a building plan (written in text in PDF) for a project, mainly bridges and sluices. The machine learning program takes every sentence in that PDF as a sample (the words in that sentence are the features), and needs to classify every sample/sentence in one of the following categories: Hardware related and Software related. (I use the Naive Bayes algorithm in combination with TF-DIF. )
However, as you can imagine, there are also a lot of irrelevant sentences which are neither hardware nor software related. Do I need to make a seperate categorie 'Default/Irrelevant', so that I have three categories in total? Or is it better to keep only the two categories, and classify them based on their probability? For example; a sentence is classified as hardware at 0.6, then I ignore it. But if the outcome is 0.8 or higher, then I classify it as hardware.
You need to use irrelevant sentences in your training set, I will explain the reason with one example:
If you have a three class classification problem you can obtain this output: Irrelevant 95% Hardware 4% Software 1%
The possibility of being Harware is 4 times the posibility of being Software. But you obviously will choose Irrelevant.
If you use a two class dataset you will obtain this output: Hardware 80% Software 20%
The posibility of being Hardware is again 4 times the posibility of being Software, but both percentages must sum 100% because the classifier thinks that both posibilities are the whole universum.
You have two different options:
1 - A 3 class classification problem (Hardware, Software, Irrelevant)
2 - Two classifiers with 2 class classification problem:
Classifier 1 -> Positive class Hardware, Negative class: Software + Irrelevant
Classifier 2 -> Positive class Software, Negative clas: Hardware + Irrelevant