Search code examples
machine-learningscikit-learnsvmsupervised-learning

Using SVM to predict text with label


I have data in a csv file in the following format

Name     Power   Money
Jon      Red     30
George   blue    20
Tom      Red     40
Bob      purple  10

I consider values like "jon", "red" and "30 as inputs. Each input as a label. For instance inputs [jon,george,tom,bob] have label "name". Inputs [red,blue,purple] have label "power". This is basically how I have training data. I have bunch of values that are each mapped to a label.

Now I want to use svm to train a model based on my training data to accurately identify given a new input what is its correct label. so for instance if the input provided is "444" , the model should be smart enough to categorize it as a "Money" label.

I have installed py and also installed sklearn. I have completed the following tutorial as well. I am just not sure on how to prepare input data to train the model.

Also I am new to machine learning if i have said something that sounds wrong or odd please point it out as I will be happy to learn the correct.


Solution

  • With how your current question is formulated, you are not dealing with a typical machine learning problem. Currently, you have column-wise data:

    Name     Power   Money
    Jon      Red     30
    George   blue    20
    Tom      Red     40
    Bob      purple  10
    

    If a user now inputs "Jon", you know it is going to be type "Name", by a simple hash-map look up, e.g.,:

    hashmap["Jon"] -> "Name"
    

    The main reason people are saying it is not a machine-learning problem is your "categorisation" or "prediction" is being defined by your column names. Machine learning problems, instead (typically), will be predicting some response variable. For example, imagine instead you had asked this:

    Name     Power   Money  Bought_item
    Jon      Red     30     yes
    George   blue    20     no
    Tom      Red     40     no
    Bob      purple  10     yes
    

    We could build a model to predict Bought_item using the features Name, Power, and Money using SVM.

    Your problem would have to look more like:

    Feature1 Feature2 Feature3 Category
    1.0      foo      bar      Name
    3.1      bar      foo      Name
    23.4     abc      def      Money
    22.22    afb      dad      Power
    223.1    dad      vxv      Money
    

    You then use Feature1, Feature2, and Feature3 to predict Category. At the moment your question does not give enough information for anyone to really understand what you need or what you have to reformulate it this way, or consider an unsupervised approach.

    Edit:

    So frame it this way:

    Name     Power   Money   Label
    Jon      Red     30      Foo
    George   blue    20      Bar
    Tom      Red     40      Foo
    Bob      purple  10      Bar
    

    OneHotEncode Name and Power, so you now have a variable for each name that can be 0/1.

    Standardise Money so that it has a range between, approximately, -1/1.

    LabelEncode your labels so that they are 0,1,2,3,4,5,6 and so on.

    Use a One vs. All classifier, http://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html.