Search code examples
matlabmachine-learningdata-miningclassificationdocument-classification

classify cell array in matlab


I want to do text categorization on a dataset of news. I have a lot of features like subject, keyword, summary, etc... all of these features are stored in one cell array of structs, each struct looking like this:

       label: 'misc.forsale'
        subj: ' Motorcycle wanted.'
     keyword: [1x190 char]
   reference: []
organization: ' Worcester Polytechnic Institute'
        from: ' [email protected] (John Kedziora)'
     summary: []
       lines: ' 11'
       vocab: [4x2 double]

I want to classify them with class = classify(test, train, target, 'diaglinear');
but these functions only receive arrays as input, and do not accept cells or structs.

I can't convert this cell array to one multidimensional array because the amount of features varies (for example, one subject has two words and other has three words).

What can I do?


Solution

  • Do some feature extraction first. For example, tokenize the strings, then use TF-IDF.

    You can include the key with the tokens. This is a common practise in information retrieval. See the Xapian manual for an example.

    Usually, you will do some stemming, e.g. Examples -> exampl. Now, just add a prefix to make the words distinct depending on their occurrence. E.g. Sexampl when the subject contained example and Kexampl when it was a keyword.

    Then you have a "bag of words" representation that is used everywhere. They even do this for mining images, it's called "visual words" then. These aren't english-language words either.