Search code examples
pythonstringiterationgeneratortokenize

How do you add a condition to a generator function based on the previous output?


I am trying to reconcile id_label_token, which is a list of tuples containing a tokenized string, label, and character index, with the original string string.

I have some working code that uses a generator to do this. However, it can't handle instances where there is a label for a token between parentheses in the original string. I am new to generators, and I am finding it difficult to implement a condition that produces my desired output.

How could I check the previous token_type against the current token_type so that I don't skip assigning a token_type when a token is between parentheses?

Some help would be appreciated. And if this question requires different framing, please don't hesitate to say so.

Data:

id_label_token = [(0, 'O', '('),
                  (1, 'DATE-B', '6'),
                  (2, 'DATE-I', ')'),
                  (4, 'DATE-I', '13th'),
                  (9, 'DATE-B', 'February'),
                  (18, 'DATE-I', '1942'),
                  (23, 'O', '('),
                  (24, 'GPE-B', 'N.S.'),
                  (28, 'O', ')')]

string = "(6) 13th February 1942 (N.S.)"

Current code:

def get_tokens(tokens):
    it = iter(tokens)
    _, token_type, next_token = next(it)
    word = yield
    while True:
        if next_token == word:
            word = yield next_token, token_type
            _, token_type, next_token = next(it)
        else:
            _, _, tmp = next(it)
            next_token += tmp

it = get_tokens(id_label_token)
next(it)
out = [it.send(w) for w in string.split()]
print(out)

Current output:

[('(6)', 'O'), ('13th', 'DATE-I'), ('February', 'DATE-B'), ('1942', 'DATE-I'), ('(N.S.)', 'O')]

Desired output:

[('(6)', 'DATE-B'), ('13th', 'DATE-I'), ('February', 'DATE-B'), ('1942', 'DATE-I'), ('(N.S.)', 'GPE-B')]

Solution

  • When you update next_token, you should also update token_type when you reach the 'heart' of the parentheses; you only have to check whether you've reached the closing ) (or rather whether you've NOT reached it):

    def get_tokens(tokens):
        it = iter(tokens)
        _, token_type, next_token = next(it)
        word = yield
        while True:
            if next_token == word:
                word = yield next_token, token_type
                _, token_type, next_token = next(it)
            else:
                _, new_tt, tmp = next(it)
                next_token += tmp
                if tmp != ')':
                    token_type = new_tt