For BIO tagging problem, I'm looking for a way to find the index of specific words in a list of strings.
For example:
text = "Britain has reduced its carbon emissions more than any rich country"
word = 'rich'
print(text.split())
['Britain', 'has', 'reduced', 'its', 'carbon', 'emissions', 'more', 'than', 'any', 'rich', 'country']
text.split(' ').index(word) # returns 9
text.split(' ').index('rich country') # occurring an error as expected
My desired answer would be:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1]
I think I can just use a loop to find the first word's index and the last word's index and then replace them into either in 0 or 1.
However my question is what if the text
list would be like this:
['Britain', 'has', 'reduced', 'its', 'carbon', 'emissions', 'more', 'than', 'any', 'rich', 'count', '_ry']
or maybe
['Britain', 'has', 'reduced', 'its', 'carbon', 'emissions', 'more', 'than', 'any', 'richcountry']
I believe I can solve this problem with using dirty for loops, but I believe there would be another clean and simple way to solve this task.
I would appreciate if you guys could give me any advice on this problem.
Thanks in advance!
In answer of your first question:
text = "Britain has reduced its carbon emissions more than any rich country"
words = 'rich country'.split(" ")
split_text = text.split()
[1 if x in words else 0 for x in split_text]
Output:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1]
The second issue will require fuzzy matching, which can be achieved with fuzzywuzzy:
from fuzzywuzzy import process
words = 'rich country'.split(" ")
split_text = ['Britain', 'has', 'reduced', 'its', 'carbon', 'emissions', 'more', 'than', 'any', 'richcountry']
[1 if process.extractBests(x, words, score_cutoff = 60) else 0 for x in split_text]
Output:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
And for
split_text = ['Britain', 'has', 'reduced', 'its', 'carbon', 'emissions', 'more', 'than', 'any', 'rich', 'count', '_ry']
[1 if process.extractBests(x, words, score_cutoff = 60) else 0 for x in split_text]
Output:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1]
Note that you can set a threshold value with score_cutoff
.