Search code examples
pythonstringlistnltktokenize

Untokenize specific words in a list


I have a list of strings and I would like to untokenize some specific strings. Imagine having the following list with strings and I would like to join the words "my" and "apple" only if they are in respectively order. I was thinking to use the detokenize function from this Python Untokenize a sentence question. Here is some reproducible code:

target = "my apple"
words = ['this', 'is', 'my', 'apple', 'and', 'this', 'is', 'not', 'your', 'apple']

Using the detokenizer:

from nltk.tokenize.treebank import TreebankWordDetokenizer    
TreebankWordDetokenizer().detokenize(['my', 'apple'])
'my apple'

But I am not sure how to use this in a list with multiple strings and with specifying a target. Here is the desired output:

target_output = ['this', 'is', 'my apple', 'and', 'this', 'is', 'not', 'your', 'apple']
['this', 'is', 'my apple', 'and', 'this', 'is', 'not', 'your', 'apple']

So I was wondering if anyone knows how to detokenize some specific words only if they are next to each other in a list?


Solution

  • The following seems simple enough:

    def detokenize(sent, tgt):
        i = 0
        tgt_len = len(tgt.split())  # this allows for phrases longer than 2
        while i < len(sent):
            if " ".join(sent[i:i+tgt_len]) == tgt:
                yield tgt
                i += tgt_len
            else:
                yield sent[i]
                i += 1
    
    >>> list(detokenize(words, "my apple"))
    ['this', 'is', 'my apple', 'and', 'this', 'is', 'not', 'your', 'apple']
    >>> list(detokenize(words, "this is not"))
    ['this', 'is', 'my', 'apple', 'and', 'this is not', 'your', 'apple']