Search code examples
svmlibsvmsupervised-learning

Sample data with too many dimensions in SVM


I am working on the training and test data as Google search snippets.

Traning data consists of 10,060 snippets. Each snippet on each line, and each snippet consists of a list of words/terms plus a class label at the end.

There are 8 class labels:

Business,Computers,Culture-Arts,Entertainment,Education-Science,Engineering,Health,Politics-Society,Sports 

The following are some of the lines in the dataset:

manufacture manufacturer directory directory china taiwan products manufacturers directory- taiwan china products manufacturer direcory exporter directory supplier directory suppliers business

empmag electronics manufacturing procurement homepage electronics manufacturing procurement magazine procrement power products production essentials data management business

dfma truecost paper true cost overseas manufacture product design costs manufacturing products china manufacturing redesigned product china save business

As you can see, the data should have the same number of dimensions to use SVM.

I am thinking use 1 to indicate if a word occurs in a specific row, and 0 otherwise, so each row will be a 0/1 vector. However, there will be too many dimensions.

My question: Is there any other ways to preprocess the data in order to perform SVM efficiently?


Solution

  • You should check for term-weighting and feature selection before performing text-classification with SVM.

    The default approach would be:

    1. Check for tfc term-weighting. This is based on the so-called inverse document frequency multiplied with term frequencies (in the current document).

    2. Check for Information Gain-based feature-selection

    3. Transform your documents on the basis of 1. and 2.

    4. Perform text-classification with SVM.

    I recommend the following publications for further understanding / reading. In this publications you will find the typical approaches used for SVM-based text-classification in the research community: