Search code examples
pythonlistnlppython-re

Findind index of words in a list of words


For BIO tagging problem, I'm looking for a way to find the index of specific words in a list of strings.

For example:

text = "Britain has reduced its carbon emissions more than any rich country"
word = 'rich'
print(text.split())
['Britain', 'has', 'reduced', 'its', 'carbon', 'emissions', 'more', 'than', 'any', 'rich', 'country']

text.split(' ').index(word) # returns 9

text.split(' ').index('rich country') # occurring an error as expected 

My desired answer would be:

[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1]

I think I can just use a loop to find the first word's index and the last word's index and then replace them into either in 0 or 1.

However my question is what if the text list would be like this:

['Britain', 'has', 'reduced', 'its', 'carbon', 'emissions', 'more', 'than', 'any', 'rich', 'count', '_ry']

or maybe

['Britain', 'has', 'reduced', 'its', 'carbon', 'emissions', 'more', 'than', 'any', 'richcountry']

I believe I can solve this problem with using dirty for loops, but I believe there would be another clean and simple way to solve this task.

I would appreciate if you guys could give me any advice on this problem.

Thanks in advance!


Solution

  • In answer of your first question:

    text = "Britain has reduced its carbon emissions more than any rich country"
    words = 'rich country'.split(" ")
    split_text = text.split()
    [1 if x in words else 0 for x in split_text]
    

    Output:

    [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1]
    

    The second issue will require fuzzy matching, which can be achieved with fuzzywuzzy:

    from fuzzywuzzy import process
    words = 'rich country'.split(" ")
    split_text = ['Britain', 'has', 'reduced', 'its', 'carbon', 'emissions', 'more', 'than', 'any', 'richcountry']
    [1 if process.extractBests(x, words, score_cutoff = 60) else 0 for x in split_text]
    

    Output:

    [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
    

    And for

    split_text = ['Britain', 'has', 'reduced', 'its', 'carbon', 'emissions', 'more', 'than', 'any', 'rich', 'count', '_ry']
    [1 if process.extractBests(x, words, score_cutoff = 60) else 0 for x in split_text]
    

    Output:

    [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1]
    

    Note that you can set a threshold value with score_cutoff.