python pandas machine-learning categorical-data conceptual

Encoding categorical data from n-length arrays of varying categories in Python

I am currently still in the early days of understanding machine learning (I am a web programmer trying to upskill) and have run into an issue based on a dataset provided by Kaggle.

It was a dataset where each feature contained 1..n labels describing the ingredients of a meal, and the target field of what cuisine the meal is from.

Ingredients {ArrayOf<string>} | Cuisine {string}
[Tomato, Spaghetti, Beef, Basil, Oregano] | Italian
[Coriander Seeds, Cumin, Paprika, Chicken, Garlic, Ginger] | Indian
[Beef, Onion] | French

This data is stylized to illustrate the point of how the data is described, with ingredients being my input, and cuisine being my target output.

What I want to know is if I have the right theory behind my approach in

iterating through each feature of the dataframe in pre-processing
getting all ingredients and adding them to a Set
For each ingredient add a new column to the dataframe named for the ingredient
Iterating through each feature and based on each ingredient setting the subsequent column to 1 or 0 (i.e. setting the "Beef" column to 1)
Training the model based on the transformed dataset

While this may work at the minute it might not be scalable, since I currently have 10'000 unique ingredients and will experience tens of thousands more in the future.

Am I on the right track with my thinking and should I make any considerations for the expansion of features in the future? and Is there any inbuilt features that support what I am trying to do?

Solution

Use:

vocab = set(j for i in df['Ingredients'] for j in i) 

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(vocabulary=vocab, analyzer=lambda x: x)

X = cv.fit_transform(df['Ingredients'])

If you load the Ingredients {ArrayOf<string>} column as text, you have to convert to list by -

df['Ingredients'] = df['Ingredients {ArrayOf<string>} '].apply(lambda x: [i.strip() for i in x.replace('[','').replace(']','').split(',')])

Output

X will be your input matrix -

X.todense()

matrix([[1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1],
        [0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0],
        [0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]], dtype=int64)

For the vocabulary -

cv.get_feature_names()

['Basil',
 'Beef',
 'Chicken',
 'Coriander Seeds',
 'Cumin',
 'Garlic',
 'Ginger',
 'Onion',
 'Oregano',
 'Paprika',
 'Spaghetti',
 'Tomato']