Search code examples
pythonpandasnlptagskeyword

Python: Tag keywords and create new columns of tags with 1's and 0's


I have the below code to iterate through sentences of a column, tag keywords in sentences, and create new columns of those tags consisting of 1's and 0's. If a keyword exists, it is automatically tagged and given a 1 in a new column named after the tag. If it does not exist but another keyword exists, it is given a 0. If the sentence does not have any keywords whatsoever, the entire row will be removed.

The code below is somewhat okay but it still misses keywords and it tags and outputs 1's and 0's on partial words and blank cells (rows with no sentences). I'm not sure what is missing? How do I ensure that it does not miss keywords and does not tag partial words and blank sentences?

pattern = '|'.join(dict_list)
tags_id = (df['description_summary']
   .str.extractall(f'({pattern})')[0]
   .map(keyword_dict)
   .reset_index(name='col')
   .assign(value=1)
   .pivot_table(index=[df['issue.id'], df['description_summary']], columns='col', values='value', fill_value=0))

Here is basically the data I'm working with in an excel file:

    issue.id  description_summary

0   753       Long sentence with keywords ball and hot
1   937       Long sentence with keywords cold, stick, and glove
2   
3   598       Long sentence with NO keywords
4   574       Long sentence with keywords very cold and cold 

Here is the current (wrong) output:

    issue.id  description_summary                                     Toy     Temperature 

0    753       Long sentence with keywords ball and hot                1       1
1    937       Long sentence with keywords cold, stick, and glove      1       1
2                                                                      1       0
3    598       Long sentence with NO keywords but outputs 1s and 0s    0       1
4    574       Long sentence with keywords very cold and cold          1       1

Here is the output I want:

    issue.id  description_summary                                     Toy     Temperature    

0    753       Long sentence with keywords ball and hot                1       1
1    937       Long sentence with keywords cold, stick, and glove      1       1
4    574       Long sentence with keywords very cold and cold          0       1

Here is the dictionary of keywords and tags ('keywords': 'tags'):

dict_list = {'Hot': 'Temperature',
 'Cold': 'Temperature',
 'Very cold': 'Temperature',
 'Ball': 'Toy',
 'Glove': 'Toy',
 'Stick': 'Toy'
 }

How do I ensure that it does not miss keywords and does not tag partial words and blank sentences?


Solution

  • I think your first issue is with map. If I reconstitute roughly what you’re doing until there:

    >>> pattern = '|'.join(dict_list.keys())
    >>> matches = df['description_summary'].str.extractall(f"({pattern})", flags=re.IGNORECASE)[0]
    >>> matches
       match
    0  0             ball
       1              hot
    1  0             cold
       1            stick
       2            glove
    4  0        very cold
       1             cold
    Name: 0, dtype: object
    >>> matches.map(dict_list)
       match
    0  0        NaN
       1        NaN
    1  0        NaN
       1        NaN
       2        NaN
    4  0        NaN
       1        NaN
    Name: 0, dtype: object
    

    However enforcing case insensitiveness we get a better result:

    >>> matches.str.lower().map({kw.lower():tag for kw, tag in dict_list.items()})
       match
    0  0                Toy
       1        Temperature
    1  0        Temperature
       1                Toy
       2                Toy
    4  0        Temperature
       1        Temperature
    Name: 0, dtype: object
    

    The second issue seems to be pivot_table which assigns the wrong lines to matches because df and matches do not have the same shape. We can instead pivot using the first level of index, and then use that to join with df:

    >>> tags = matches.str.lower().map({kw.lower():tag for kw, tag in dict_list.items()})
    >>> tags = tags.rename_axis(['line', 'match']).reset_index(name='tag').assign(value=1)
    >>> tags.pivot_table(index='line', columns='tag', values='value', fill_value=0).join(df[['issue.id', 'description_summary']])
          Temperature  Toy  issue.id                                description_summary
    line                                                                               
    0               1    1     753.0           Long sentence with keywords ball and hot
    1               1    1     937.0  Long sentence with keywords cold, stick, and g...
    4               1    0     574.0     Long sentence with keywords very cold and cold