Search code examples
machine-learningnlpclassificationibm-cloudibm-watson

Does the IBM Watson Natural Language Classifier support multiple classes and multiple class sets?


I'm trying to solve the following with the IBM Watson Natural Language Classifier on IBM Bluemix:

I have N training documents D labeled with labels l_x_y of different Label Sets S_1 to S_n. Where x defines the label set and y the actual label within the set. Each document can be labeled with multiple labels (coming from different Label Sets).

Here an Example:

Label Set 1 : S_1={a,b,c,d,e,f} Label Set 2 : S_2={1,2,3,4,5,6}

D_1 = "This is some text", {a,c,e,1,3,4} D_2 = "This is some text2", {d,f,4}

If I understood correctly the REST service is capable of being trained with multiple classes. The naive approach would be to just train a different classifier for each label set.

But is there a better way to do this? E.g. can I use the union of the labels of each set (as illustrated in D_1 and D_2) ?

Because the API Documentation says the following about the response:

An array [Classes] of up to ten class_name-confidence pairs that are sorted in descending order of confidence. If there are fewer than 10 classes, the sum of the confidence values is 100%.

So this means if the cardinality of the union of all label sets is > 10 it might omit low confidence classes, but is there any other issue with using the union of the label sets?


Solution

  • The data format specifies that each column after the "text" will be considered as a class label. If you send the training data as (in your case):

    "This is some text", "{a,c,e,1,3,4}"

    "This is some text2", "{d,f,4}"

    then, the service assumes there are two unique classes in the training data: {a,c,e,1,3,4} and {d,f,4}.

    However, you may try training on multiple labels by creating a training data like:

    "This is some text", a,c,e,1,3,4

    "This is some text2", d,f,4

    in which case, you are training on 8 unique classes. Hence, the classification output will contain the confidence values for these classes. It is up to you to categorize the resulting classes in either of those label sets.