I'm looking to train an ML model to classify words into several broad categories, in my case: color. So I'll have some pre-defined color buckets like these for example:
let blue = ["blue", "royal", "dark blue", "light blue"]
let red = ["red", "cardinal", "dusty red", "red polka dot"]
And I want
a) For the model to classify colors already existing in the buckets, i.e. if given "blue" it will know that "blue" is in the blue
bucket.
b) For the model to take words not seen before, such as "faded blue", and to classify them in the correct bucket, in this case blue
based on a confidence score of some sort.
I'm not sure if this is possible and the current method I have is a series of if statements to go about the classifying, but I'm wondering if there is a more intuitive way to do this with an ML model.
You can try scikit-learn:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
data = {'blue': ["blue", "royal", "dark blue", "light blue"],
'red': ["red", "cardinal", "dusty red", "red polka dot"]}
train_data = pd.DataFrame(data).T.reset_index()
train_data.rename(columns={'index':'target'}, inplace=True)
# predictors
X = train_data.drop('target', axis=1)
X = X.apply(lambda x: ','.join(x), axis=1)
# target
y = train_data.target
# simple toy model
clf = Pipeline(steps=[
('vec', CountVectorizer(ngram_range=(1, 2),)),
('clf', LogisticRegression())
])
# train a model
clf.fit(X,y)
# predict a new value
print(clf.predict(['faded blue']))
Hope this will set you to the right path :)