machine-learning scikit-learn svm supervised-learning

Using SVM to predict text with label

I have data in a csv file in the following format

Name     Power   Money
Jon      Red     30
George   blue    20
Tom      Red     40
Bob      purple  10

I consider values like "jon", "red" and "30 as inputs. Each input as a label. For instance inputs [jon,george,tom,bob] have label "name". Inputs [red,blue,purple] have label "power". This is basically how I have training data. I have bunch of values that are each mapped to a label.

Now I want to use svm to train a model based on my training data to accurately identify given a new input what is its correct label. so for instance if the input provided is "444" , the model should be smart enough to categorize it as a "Money" label.

I have installed py and also installed sklearn. I have completed the following tutorial as well. I am just not sure on how to prepare input data to train the model.

Also I am new to machine learning if i have said something that sounds wrong or odd please point it out as I will be happy to learn the correct.

Solution

With how your current question is formulated, you are not dealing with a typical machine learning problem. Currently, you have column-wise data:

Name     Power   Money
Jon      Red     30
George   blue    20
Tom      Red     40
Bob      purple  10

If a user now inputs "Jon", you know it is going to be type "Name", by a simple hash-map look up, e.g.,:

hashmap["Jon"] -> "Name"

The main reason people are saying it is not a machine-learning problem is your "categorisation" or "prediction" is being defined by your column names. Machine learning problems, instead (typically), will be predicting some response variable. For example, imagine instead you had asked this:

Name     Power   Money  Bought_item
Jon      Red     30     yes
George   blue    20     no
Tom      Red     40     no
Bob      purple  10     yes

We could build a model to predict Bought_item using the features Name, Power, and Money using SVM.

Your problem would have to look more like:

Feature1 Feature2 Feature3 Category
1.0      foo      bar      Name
3.1      bar      foo      Name
23.4     abc      def      Money
22.22    afb      dad      Power
223.1    dad      vxv      Money

You then use Feature1, Feature2, and Feature3 to predict Category. At the moment your question does not give enough information for anyone to really understand what you need or what you have to reformulate it this way, or consider an unsupervised approach.

Edit:

So frame it this way:

Name     Power   Money   Label
Jon      Red     30      Foo
George   blue    20      Bar
Tom      Red     40      Foo
Bob      purple  10      Bar

OneHotEncode Name and Power, so you now have a variable for each name that can be 0/1.

Standardise Money so that it has a range between, approximately, -1/1.

LabelEncode your labels so that they are 0,1,2,3,4,5,6 and so on.

Use a One vs. All classifier, http://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html.