I have a million files which includes free text. Each file has been assigned a code or number of codes. The codes can be assumed as categories. I have normalized the text by removing stop words. I am using scikit-learn libsvm to train the model to predict the files for the right code/s (category).
I have read and searched a lot but i couldn't understand how to represent my textual data into integers, since SVM or most machine learning tools use numerical values for learning.
I think i would need to find tf-idf for each term in the whole corpus. But still i am not sure how would that help me to convert my textual data into libsvm format.
any help would be greatly appreciated, Thank you.
You are not forced to use tf-idf.
To begin with follow this simple approach:
I have two documents (stop word removed, stemmed) :
hello world
and
hello sky sunny hello
Step 1: I generate the following vocabulary
hello
sky
sunny
world
Step 2:
I can represent my documents like this:
1 4
(because the word hello is in position 1 in the vocabulary and the word world is in position 4) and
1 2 3 1
Step 3: I add the term frequency near each term and remove duplicates
1:1 4:1
(because the word hello appears 1 time in the document, and the word world appears 1 time)
and
1:2 2:1 3:1
If you add the class number in front of each line, you have a file in libsvm format:
1 1:1 4:1
2,3 1:2 2:1 3:1
Here the first document has class 1, and the second document has class 2 and 3.
In this example each word is associated with the term frequency. To use tf-idf you do the same but replace the tf by the computed tf-idf.