Search code examples
pythontensorflowscikit-learndeep-learningcoreml

Is there a way to train an ML model to classify words into categories?


I'm looking to train an ML model to classify words into several broad categories, in my case: color. So I'll have some pre-defined color buckets like these for example:

let blue = ["blue", "royal", "dark blue", "light blue"]
let red = ["red", "cardinal", "dusty red", "red polka dot"]

And I want

a) For the model to classify colors already existing in the buckets, i.e. if given "blue" it will know that "blue" is in the blue bucket.

b) For the model to take words not seen before, such as "faded blue", and to classify them in the correct bucket, in this case blue based on a confidence score of some sort.

I'm not sure if this is possible and the current method I have is a series of if statements to go about the classifying, but I'm wondering if there is a more intuitive way to do this with an ML model.


Solution

  • You can try scikit-learn:

    import pandas as pd
    from sklearn.pipeline import Pipeline
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.linear_model import  LogisticRegression
    
    
    data = {'blue': ["blue", "royal", "dark blue", "light blue"],
            'red': ["red", "cardinal", "dusty red", "red polka dot"]}
    
    train_data = pd.DataFrame(data).T.reset_index()
    train_data.rename(columns={'index':'target'}, inplace=True)
    
    # predictors
    X = train_data.drop('target', axis=1)
    X = X.apply(lambda x: ','.join(x), axis=1)
    
    # target
    y = train_data.target
    
    # simple toy model
    clf  = Pipeline(steps=[
            ('vec',  CountVectorizer(ngram_range=(1, 2),)),
            ('clf', LogisticRegression())
    ])
    
    # train a model
    clf.fit(X,y)
    
    # predict a new value
    print(clf.predict(['faded blue']))
    

    Hope this will set you to the right path :)

    Results from above: enter image description here