Search code examples
pythonnlpsvmorange

How to prepare text data for orange SVM train?


I used NLTK classifiers 2 years ago. Now I want to learn to use orange SVM for text classification. The example for SVM in orange tutorial is iris.tab:

sepal length    sepal width petal length    petal width iris
c   c   c   c   d
                class
5.1 3.5 1.4 0.2 Iris-setosa
4.9 3.0 1.4 0.2 Iris-setosa

If I want to classify text, how to prepare data. Is it like the below?

token     frequency     tokenlength

the        23             3
for        21             3
at         10             2

Please give me examples of different ways of preparing data. Can token be seen as label in SVM, if not, how to do it?

Thanks very much in advance.


Solution

  • Short answer: No.

    Long answer: The label refers to the category of documents you want to process. For example if you are trying to categorize documents into two categories, such as SPAM and HAM, then the labels should be SPAM and HAM. For data representation you may use tecnhiques such as Bag of Words (http://en.wikipedia.org/wiki/Bag_of_words_model).

    For further information I suggest the following: