Search code examples
pythonpandasdummy-variable

Pandas create dummy features for each string in a dictionary of lists


Implementing the following logic for a feature engineering purpose. A simple approach is easy but wondering if there is a more efficient solution that anyone can think of. Ideas are appreciated if you don't feel like implementing the whole code!

Take this DataFrame and dictionary

import pandas as pd
random_animals = pd.DataFrame(
                {'description':['xdogx','xcatx','xhamsterx','xdogx'
                                ,'xhorsex','xdonkeyx','xcatx']
                })


cat_dict = {'category_a':['dog','cat']
            ,'category_b':['horse','donkey']}

We want to create a column/feature for each string in the dictionary AND for each category. 1 if string is contained in the description column 0 otherwise.

So the output for this toy example would look like:

  description  is_dog is_cat is_horse is_donkey is_category_a is_category_b
0       xdogx       1      0        0         0             1             0
1       xcatx       0      1        0         0             1             0    
2   xhamsterx       0      0        0         0             0             0
3       xdogx       1      0        0         0             1             0
4     xhorsex       0      0        1         0             0             1
5    xdonkeyx       0      0        0         1             0             1
6       xcatx       0      1        0         0             1             0

Simple approach would be iterating once for each output column required and running (for each column, just hardcoded is_dog here for simplicity)

random_animals['is_dog'] = random_animals['description'].str.contains('dog')*1

There can be an arbitrary number of strings and categories in the cat_dict so I am wondering if there is a way to do this otherwise.


Solution

  • Here is a vectorized method. The main observation is that random_animals.description.str.contains when applied to a string returns a Series of indicators, one for each row of random_animals.

    Since random_animals.description.str.contains is itself a vectorized function, we can apply it to the collection of animals to obtain a full indicator matrix.

    Finally, we can add categories by enforcing logic between different columns. This will likely be faster than checking for string inclusion multiple times.

    import pandas as pd
    random_animals = pd.DataFrame(
                    {'description':['xdogx','xcatx','xhamsterx','xdogx'
                                    ,'xhorsex','xdonkeyx','xcatx']
                    })
    
    
    cat_dict = {'category_a':['dog', 'cat']
                ,'category_b':['horse', 'donkey']}
    
    # create a Series containing all individual animals (without duplicates)
    animals = pd.Series([animal for v in cat_dict.values()
            for animal in v])
    
    df = pd.DataFrame(
            animals.apply(random_animals.description.str.contains).T.values,
            index  = random_animals.description,
            columns = animals).astype(int)
    
    for cat, animals in cat_dict.items():
        df[cat] = df[animals].any(axis=1).astype(int)
    
                 # dog  cat  horse  donkey  category_a  category_b
    # description
    # xdogx          1    0      0       0           1           0
    # xcatx          0    1      0       0           1           0
    # xhamsterx      0    0      0       0           0           0
    # xdogx          1    0      0       0           1           0
    # xhorsex        0    0      1       0           0           1
    # xdonkeyx       0    0      0       1           0           1
    # xcatx          0    1      0       0           1           0