I am trying to reconcile id_label_token
, which is a list of tuples containing a tokenized string, label, and character index, with the original string string
.
I have some working code that uses a generator to do this. However, it can't handle instances where there is a label for a token between parentheses in the original string. I am new to generators, and I am finding it difficult to implement a condition that produces my desired output.
How could I check the previous token_type
against the current token_type
so that I don't skip assigning a token_type
when a token is between parentheses?
Some help would be appreciated. And if this question requires different framing, please don't hesitate to say so.
Data:
id_label_token = [(0, 'O', '('),
(1, 'DATE-B', '6'),
(2, 'DATE-I', ')'),
(4, 'DATE-I', '13th'),
(9, 'DATE-B', 'February'),
(18, 'DATE-I', '1942'),
(23, 'O', '('),
(24, 'GPE-B', 'N.S.'),
(28, 'O', ')')]
string = "(6) 13th February 1942 (N.S.)"
Current code:
def get_tokens(tokens):
it = iter(tokens)
_, token_type, next_token = next(it)
word = yield
while True:
if next_token == word:
word = yield next_token, token_type
_, token_type, next_token = next(it)
else:
_, _, tmp = next(it)
next_token += tmp
it = get_tokens(id_label_token)
next(it)
out = [it.send(w) for w in string.split()]
print(out)
Current output:
[('(6)', 'O'), ('13th', 'DATE-I'), ('February', 'DATE-B'), ('1942', 'DATE-I'), ('(N.S.)', 'O')]
Desired output:
[('(6)', 'DATE-B'), ('13th', 'DATE-I'), ('February', 'DATE-B'), ('1942', 'DATE-I'), ('(N.S.)', 'GPE-B')]
When you update next_token
, you should also update token_type
when you reach the 'heart' of the parentheses; you only have to check whether you've reached the closing )
(or rather whether you've NOT reached it):
def get_tokens(tokens):
it = iter(tokens)
_, token_type, next_token = next(it)
word = yield
while True:
if next_token == word:
word = yield next_token, token_type
_, token_type, next_token = next(it)
else:
_, new_tt, tmp = next(it)
next_token += tmp
if tmp != ')':
token_type = new_tt