Search code examples
pythonpandasmachine-learningcategorical-dataconceptual

Encoding categorical data from n-length arrays of varying categories in Python


I am currently still in the early days of understanding machine learning (I am a web programmer trying to upskill) and have run into an issue based on a dataset provided by Kaggle.

It was a dataset where each feature contained 1..n labels describing the ingredients of a meal, and the target field of what cuisine the meal is from.

Ingredients {ArrayOf<string>} | Cuisine {string}
[Tomato, Spaghetti, Beef, Basil, Oregano] | Italian
[Coriander Seeds, Cumin, Paprika, Chicken, Garlic, Ginger] | Indian
[Beef, Onion] | French

This data is stylized to illustrate the point of how the data is described, with ingredients being my input, and cuisine being my target output.

What I want to know is if I have the right theory behind my approach in

  • iterating through each feature of the dataframe in pre-processing
  • getting all ingredients and adding them to a Set
  • For each ingredient add a new column to the dataframe named for the ingredient
  • Iterating through each feature and based on each ingredient setting the subsequent column to 1 or 0 (i.e. setting the "Beef" column to 1)
  • Training the model based on the transformed dataset

While this may work at the minute it might not be scalable, since I currently have 10'000 unique ingredients and will experience tens of thousands more in the future.

Am I on the right track with my thinking and should I make any considerations for the expansion of features in the future? and Is there any inbuilt features that support what I am trying to do?


Solution

  • Use:

    vocab = set(j for i in df['Ingredients'] for j in i) 
    
    from sklearn.feature_extraction.text import CountVectorizer
    cv = CountVectorizer(vocabulary=vocab, analyzer=lambda x: x)
    
    X = cv.fit_transform(df['Ingredients'])
    

    If you load the Ingredients {ArrayOf<string>} column as text, you have to convert to list by -

    df['Ingredients'] = df['Ingredients {ArrayOf<string>} '].apply(lambda x: [i.strip() for i in x.replace('[','').replace(']','').split(',')])
    

    Output

    X will be your input matrix -

    X.todense()
    
    matrix([[1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1],
            [0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0],
            [0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]], dtype=int64)
    

    For the vocabulary -

    cv.get_feature_names()
    
    ['Basil',
     'Beef',
     'Chicken',
     'Coriander Seeds',
     'Cumin',
     'Garlic',
     'Ginger',
     'Onion',
     'Oregano',
     'Paprika',
     'Spaghetti',
     'Tomato']