I am currently still in the early days of understanding machine learning (I am a web programmer trying to upskill) and have run into an issue based on a dataset provided by Kaggle.
It was a dataset where each feature contained 1..n labels describing the ingredients of a meal, and the target field of what cuisine the meal is from.
Ingredients {ArrayOf<string>} | Cuisine {string}
[Tomato, Spaghetti, Beef, Basil, Oregano] | Italian
[Coriander Seeds, Cumin, Paprika, Chicken, Garlic, Ginger] | Indian
[Beef, Onion] | French
This data is stylized to illustrate the point of how the data is described, with ingredients being my input, and cuisine being my target output.
What I want to know is if I have the right theory behind my approach in
While this may work at the minute it might not be scalable, since I currently have 10'000 unique ingredients and will experience tens of thousands more in the future.
Am I on the right track with my thinking and should I make any considerations for the expansion of features in the future? and Is there any inbuilt features that support what I am trying to do?
Use:
vocab = set(j for i in df['Ingredients'] for j in i)
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(vocabulary=vocab, analyzer=lambda x: x)
X = cv.fit_transform(df['Ingredients'])
If you load the Ingredients {ArrayOf<string>}
column as text
, you have to convert to list by -
df['Ingredients'] = df['Ingredients {ArrayOf<string>} '].apply(lambda x: [i.strip() for i in x.replace('[','').replace(']','').split(',')])
Output
X
will be your input matrix -
X.todense()
matrix([[1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1],
[0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]], dtype=int64)
For the vocabulary -
cv.get_feature_names()
['Basil',
'Beef',
'Chicken',
'Coriander Seeds',
'Cumin',
'Garlic',
'Ginger',
'Onion',
'Oregano',
'Paprika',
'Spaghetti',
'Tomato']