Search code examples
svmlibsvmfeature-extractiondata-representation

data representation for svm


I have a million files which includes free text. Each file has been assigned a code or number of codes. The codes can be assumed as categories. I have normalized the text by removing stop words. I am using scikit-learn libsvm to train the model to predict the files for the right code/s (category).

I have read and searched a lot but i couldn't understand how to represent my textual data into integers, since SVM or most machine learning tools use numerical values for learning.

I think i would need to find tf-idf for each term in the whole corpus. But still i am not sure how would that help me to convert my textual data into libsvm format.

any help would be greatly appreciated, Thank you.


Solution

  • You are not forced to use tf-idf.

    To begin with follow this simple approach:

    1. Select all distinct words in all your documents. This will be your vocabulary. Save it in a file.
    2. For each word in a specific document, replace it with the index of the word in your vocabulary file.
    3. and also add the number of time the word appears in the document

    Example:

    I have two documents (stop word removed, stemmed) :

    hello world

    and

    hello sky sunny hello

    Step 1: I generate the following vocabulary

    hello
    sky
    sunny
    world
    

    Step 2:

    I can represent my documents like this:

    1 4

    (because the word hello is in position 1 in the vocabulary and the word world is in position 4) and

    1 2 3 1


    Step 3: I add the term frequency near each term and remove duplicates

    1:1 4:1

    (because the word hello appears 1 time in the document, and the word world appears 1 time)

    and

    1:2 2:1 3:1


    If you add the class number in front of each line, you have a file in libsvm format:

    1 1:1 4:1
    2,3 1:2 2:1 3:1 
    

    Here the first document has class 1, and the second document has class 2 and 3.

    In this example each word is associated with the term frequency. To use tf-idf you do the same but replace the tf by the computed tf-idf.