Search code examples
pythonpandastextextractfeature-detection

How to create new columns based on phrase existence?


I want to create new columns based on phrase existence

This is my data

No   Body
1    Office software is already paid
2    Excel software is not paid yet
3    Power point software is already paid

I want to categorized by existence of some phrase, This is my code,

countries1 = df.body.str.extract('(software|is already paid)', expand = False)
dummies1 = pd.get_dummies(countries1)
df_1 = pd.concat([df,dummies1],axis = 1)

The result is

No   Body                                   software   is already paid    
1    Office software is already paid        0          1
2    Excel software is not paid yet         1          0
3    Power point software is already paid   0          1

What I expected is

No   Body                                   software   is already paid    
1    Office software is already paid        1          1
2    Excel software is not paid yet         1          0
3    Power point software is already paid   1          1

Whats wrong in my code? or maybe I don't use the right function


Solution

  • Let's try using extractall:

    df.assign(**df.Body.str.extractall('(software|is already paid)')[0]
                  .str.get_dummies().sum(level=0))
    

    Output:

       No                                  Body  is already paid  software
    0   1       Office software is already paid                1         1
    1   2        Excel software is not paid yet                0         1
    2   3  Power point software is already paid                1         1