I used NLTK classifiers 2 years ago. Now I want to learn to use orange SVM for text classification. The example for SVM in orange tutorial is iris.tab:
sepal length sepal width petal length petal width iris
c c c c d
class
5.1 3.5 1.4 0.2 Iris-setosa
4.9 3.0 1.4 0.2 Iris-setosa
If I want to classify text, how to prepare data. Is it like the below?
token frequency tokenlength
the 23 3
for 21 3
at 10 2
Please give me examples of different ways of preparing data. Can token be seen as label in SVM, if not, how to do it?
Thanks very much in advance.
Short answer: No.
Long answer: The label refers to the category of documents you want to process. For example if you are trying to categorize documents into two categories, such as SPAM and HAM, then the labels should be SPAM and HAM. For data representation you may use tecnhiques such as Bag of Words (http://en.wikipedia.org/wiki/Bag_of_words_model).
For further information I suggest the following: