I have the below code to iterate through sentences of a column, tag keywords in sentences, and create new columns of those tags consisting of 1's and 0's. If a keyword exists, it is automatically tagged and given a 1 in a new column named after the tag. If it does not exist but another keyword exists, it is given a 0. If the sentence does not have any keywords whatsoever, the entire row will be removed.
The code below is somewhat okay but it still misses keywords and it tags and outputs 1's and 0's on partial words and blank cells (rows with no sentences). I'm not sure what is missing? How do I ensure that it does not miss keywords and does not tag partial words and blank sentences?
pattern = '|'.join(dict_list)
tags_id = (df['description_summary']
.str.extractall(f'({pattern})')[0]
.map(keyword_dict)
.reset_index(name='col')
.assign(value=1)
.pivot_table(index=[df['issue.id'], df['description_summary']], columns='col', values='value', fill_value=0))
Here is basically the data I'm working with in an excel file:
issue.id description_summary
0 753 Long sentence with keywords ball and hot
1 937 Long sentence with keywords cold, stick, and glove
2
3 598 Long sentence with NO keywords
4 574 Long sentence with keywords very cold and cold
Here is the current (wrong) output:
issue.id description_summary Toy Temperature
0 753 Long sentence with keywords ball and hot 1 1
1 937 Long sentence with keywords cold, stick, and glove 1 1
2 1 0
3 598 Long sentence with NO keywords but outputs 1s and 0s 0 1
4 574 Long sentence with keywords very cold and cold 1 1
Here is the output I want:
issue.id description_summary Toy Temperature
0 753 Long sentence with keywords ball and hot 1 1
1 937 Long sentence with keywords cold, stick, and glove 1 1
4 574 Long sentence with keywords very cold and cold 0 1
Here is the dictionary of keywords and tags ('keywords': 'tags'):
dict_list = {'Hot': 'Temperature',
'Cold': 'Temperature',
'Very cold': 'Temperature',
'Ball': 'Toy',
'Glove': 'Toy',
'Stick': 'Toy'
}
How do I ensure that it does not miss keywords and does not tag partial words and blank sentences?
I think your first issue is with map
. If I reconstitute roughly what you’re doing until there:
>>> pattern = '|'.join(dict_list.keys())
>>> matches = df['description_summary'].str.extractall(f"({pattern})", flags=re.IGNORECASE)[0]
>>> matches
match
0 0 ball
1 hot
1 0 cold
1 stick
2 glove
4 0 very cold
1 cold
Name: 0, dtype: object
>>> matches.map(dict_list)
match
0 0 NaN
1 NaN
1 0 NaN
1 NaN
2 NaN
4 0 NaN
1 NaN
Name: 0, dtype: object
However enforcing case insensitiveness we get a better result:
>>> matches.str.lower().map({kw.lower():tag for kw, tag in dict_list.items()})
match
0 0 Toy
1 Temperature
1 0 Temperature
1 Toy
2 Toy
4 0 Temperature
1 Temperature
Name: 0, dtype: object
The second issue seems to be pivot_table
which assigns the wrong lines to matches because df
and matches
do not have the same shape. We can instead pivot using the first level of index, and then use that to join with df
:
>>> tags = matches.str.lower().map({kw.lower():tag for kw, tag in dict_list.items()})
>>> tags = tags.rename_axis(['line', 'match']).reset_index(name='tag').assign(value=1)
>>> tags.pivot_table(index='line', columns='tag', values='value', fill_value=0).join(df[['issue.id', 'description_summary']])
Temperature Toy issue.id description_summary
line
0 1 1 753.0 Long sentence with keywords ball and hot
1 1 1 937.0 Long sentence with keywords cold, stick, and g...
4 1 0 574.0 Long sentence with keywords very cold and cold