Search code examples
pythonpython-3.xpandasdataframedummy-variable

Pandas finding a text in row and assign a dummy variable value based on this


I have a data frame which contains a text column i.e. df["input"],

I would like to create a new variable which checks whether df["input"] column contains any of the word in a given list and assigns a value of 1 if previous dummy variable is equal to 0 (logic is 1) create a dummy variable that equals to zero 2) replace it to one if it contains any word in a given list and it was not contained in the previous lists.)

# Example lists
listings = ["amazon listing", "ecommerce", "products"]
scripting = ["subtitle",  "film", "dubbing"]
medical = ["medical", "biotechnology", "dentist"]

df = pd.DataFrame({'input': ['amazon listing subtitle', 
                             'medical', 
                             'film biotechnology dentist']})

which looks like:

input
amazon listing subtitle
medical 
film biotechnology dentist

final dataset should look like:

input                           listings  scripting  medical
amazon listing subtitle            1         0         0
medical                            0         0         1          
film biotechnology dentist         0         1         0

Solution

  • One possible implementation is to use str.contains in a loop to create the 3 columns, then use idxmax to get the column name (or the list name) of the first match, then create a dummy variable from these matches:

    import numpy as np
    d = {'listings':listings, 'scripting':scripting, 'medical':medical}
    for k,v in d.items():
        df[k] = df['input'].str.contains('|'.join(v))
    
    arr = df[list(d)].to_numpy()
    tmp = np.zeros(arr.shape, dtype='int8')
    tmp[np.arange(len(arr)), arr.argmax(axis=1)] = arr.max(axis=1)
    out = pd.DataFrame(tmp, columns=list(d)).combine_first(df)
    

    But in this case, it might be more efficient to use a nested for-loop:

    import re
    def get_dummy_vars(col, lsts):
        out = []
        len_lsts = len(lsts)
        for row in col:
            tmp = []
            # in the nested loop, we use the any function to check for the first match 
            # if there's a match, break the loop and pad 0s since we don't care if there's another match
            for lst in lsts:
                tmp.append(int(any(True for x in lst if re.search(fr"\b{x}\b", row))))
                if tmp[-1]:
                    break
            tmp += [0] * (len_lsts - len(tmp))
            out.append(tmp)
        return out
    
    lsts = [listings, scripting, medical]
    out = df.join(pd.DataFrame(get_dummy_vars(df['input'], lsts), columns=['listings', 'scripting', 'medical']))
    

    Output:

                            input listings medical scripting
    0     amazon listing subtitle        1       0         0
    1                     medical        0       1         0
    2  film biotechnology dentist        0       0         1