.This is a basic question .I am trying to classify text files into 20 different classes.
Therefore I have a project structure with a folder called train,test. In the train folder I have 20 different folders ,each folder again has many files related to that particular class.ex:weather, atheism...etc
I have now created a train.arff file for the entire train folder.When the data is visualized through I can see only two attributes . Have provided a link below:
My Doubt is how can i view the various files under these folders and remove the stopwords,punctuation,stemmin.How do I go about preprocessing.If some links to good resources are available please suggest and provide the necessary links
I found the videos below quite helpful when I first got my hands on text classification using Weka. You might want to take a look.
You might want to use StringToWordVector filter to see the effect of each word as an attribute, which is indeed described in detail in the first and last video . Within the filter settings you can give a stopwords list and choose in each run to use it or not. Same with the stemming you can change it as well. This documentation and videos will get you to understand it easily.