Search code examples
pythonstringindexingtuplestokenize

How do you reconcile a list of tuples containing a tokenized string with the original string?


I am trying to reconcile idx_tag_token, which is a list of tuples containing a tokenized string and its label and character index, with the original string word_string. I want to output a list of tuples, with each tuple containing an element of the original string if split on whitespace, along with label information from idx_tag_token.

I have written some code that finds a token's associated word in word_string based on the character index. I then create a list of tuples with each of these words and the associated label. This is defined as word_tag_list. However, based on this, I am unsure how to proceeed to create the desired output.

The conditions to update the labels are not complicated, but I can't work out the appropriate system here.

Any assistance would be truly appreciated.

The data:

word_string = "At London, the 12th in February, 1942, and for that that reason Mark's (3) wins, American parts"

idx_tag_token =[(0, 'O', 'At'),
                (3, 'GPE-B', 'London'),
                (9, 'O', ','),
                (11, 'DATE-B', 'the'),
                (15, 'DATE-I', '12th'),
                (20, 'O', 'in'),
                (23, 'DATE-B', 'February'),
                (31, 'DATE-I', ','),
                (33, 'DATE-I', '1942'),
                (37, 'O', ','),
                (39, 'O', 'and'),
                (43, 'O', 'for'),
                (47, 'O', 'that'),
                (52, 'O', 'that'),
                (57, 'O', 'reason'),
                (64, 'PERSON-B', 'Mark'),
                (68, 'O', "'s"),
                (71, 'O', '('),
                (72, 'O', '3'),
                (73, 'O', ')'),
                (75, 'O', 'wins'),
                (79, 'O', ','),
                (81, 'NORP-B', 'American'),
                (90, 'O', 'parts')]

My code:

def find_word_from_index(idx, word_string):
    words = word_string.split()
    current_index = 0

    for word in words:
        start_index = current_index
        end_index = current_index + len(word) - 1
        if start_index <= idx <= end_index:
            return word
        current_index = end_index + 2
    return None


word_tag_list = []
for index, tag, _ in idx_tag_token:
    word = find_word_from_index(index, word_string)
    word_tag_list.append((word, tag))
word_tag_list

Current output:

[('At', 'O'),
 ('London,', 'GPE-B'),
 ('London,', 'O'),
 ('the', 'DATE-B'),
 ('12th', 'DATE-I'),
 ('in', 'O'),
 ('February,', 'DATE-B'),
 ('February,', 'DATE-I'),
 ('1942,', 'DATE-I'),
 ('1942,', 'O'),
 ('and', 'O'),
 ('for', 'O'),
 ('that', 'O'),
 ('that', 'O'),
 ('reason', 'O'),
 ("Mark's", 'PERSON-B'),
 ("Mark's", 'O'),
 ('(3)', 'O'),
 ('(3)', 'O'),
 ('(3)', 'O'),
 ('wins,', 'O'),
 ('wins,', 'O'),
 ('American', 'NORP-B'),
 ('parts', 'O')]

Desired output:

[('At', 'O'),
('London,', 'GPE-B'),
('the', 'DATE-B'),
('12th', 'DATE-I'),
('in', 'O'),
('February,', 'DATE-B'),
('1942,', 'DATE-I'),
('and', 'O'),
('for', 'O'),
('that', 'O'),
('that', 'O'),
('reason', 'O'),
("Mark's", 'PERSON-B'),
('(3)', 'O'),
('wins,', 'O'),
('American', 'NORP-B'),
('parts', 'O')]

Solution

  • Try:

    def get_tokens(tokens):
        it = iter(tokens)
        _, token_type, next_token = next(it)
        word = yield
        while True:
            if next_token == word:
                word = yield next_token, token_type
                _, token_type, next_token = next(it)
            else:
                _, _, tmp = next(it)
                next_token += tmp
    
    it = get_tokens(idx_tag_token)
    next(it)
    out = [it.send(w) for w in word_string.split()]
    
    print(out)
    

    Prints:

    [
        ("At", "O"),
        ("London,", "GPE-B"),
        ("the", "DATE-B"),
        ("12th", "DATE-I"),
        ("in", "O"),
        ("February,", "DATE-B"),
        ("1942,", "DATE-I"),
        ("and", "O"),
        ("for", "O"),
        ("that", "O"),
        ("that", "O"),
        ("reason", "O"),
        ("Mark's", "PERSON-B"),
        ("(3)", "O"),
        ("wins,", "O"),
        ("American", "NORP-B"),
        ("parts", "O"),
    ]