python machine-learning scikit-learn classification

extracting overlapping categories through machine learning

I am trying to get attributes of products that may overlap.

Given the title, manufacturer, description, I need to know whether the product is a Jeans or something else and further more, whether it’s a or Skinny Jeans or other types of Jeans. Going through the scikit-learn exercises it seems I can only predict one category at a time, which doesn’t apply to my case. Any suggestion on how to tackle the problem?

What I have in mind right now is to have a training data for each category ex:

Jeans = ['desc of jeans 1', 'desc of jeans 2']
Skinny Jeans ['desc of skinny jeans 1', 'desc of skinny jeans 2']

with this training data, I would then ask the probability of a given unknown product and expect this kind of answer in return in percentage of matching:

Unknown_Product_1 = {
    'jeans': 93,
    'skinny_jeans': 80,
    't-shirt': 5
}

Am I way off base? If this is a correct path to take, if so, how do I achieve it?

Solution

You are probably describing a task called multi-label learning or multi-label classification.

A key difference between this task and the standard classification task is that by learning a relationship between the labels, you can sometimes obtain better performance than if you train many independent standard classifiers.