I want to do text categorization on a dataset of news. I have a lot of features like subject
, keyword
, summary
, etc... all of these features are stored in one cell array of structs, each struct looking like this:
label: 'misc.forsale'
subj: ' Motorcycle wanted.'
keyword: [1x190 char]
reference: []
organization: ' Worcester Polytechnic Institute'
from: ' [email protected] (John Kedziora)'
summary: []
lines: ' 11'
vocab: [4x2 double]
I want to classify them with class = classify(test, train, target, 'diaglinear');
but these functions only receive arrays as input, and do not accept cells or structs.
I can't convert this cell array to one multidimensional array because the amount of features varies (for example, one subject has two words and other has three words).
What can I do?
Do some feature extraction first. For example, tokenize the strings, then use TF-IDF.
You can include the key with the tokens. This is a common practise in information retrieval. See the Xapian manual for an example.
Usually, you will do some stemming, e.g. Examples -> exampl
. Now, just add a prefix to make the words distinct depending on their occurrence. E.g. Sexampl
when the subject contained example
and Kexampl
when it was a keyword.
Then you have a "bag of words" representation that is used everywhere. They even do this for mining images, it's called "visual words" then. These aren't english-language words either.