I have a list of strings and I would like to untokenize some specific strings. Imagine having the following list with strings and I would like to join the words "my" and "apple" only if they are in respectively order. I was thinking to use the detokenize
function from this Python Untokenize a sentence question. Here is some reproducible code:
target = "my apple"
words = ['this', 'is', 'my', 'apple', 'and', 'this', 'is', 'not', 'your', 'apple']
Using the detokenizer:
from nltk.tokenize.treebank import TreebankWordDetokenizer
TreebankWordDetokenizer().detokenize(['my', 'apple'])
'my apple'
But I am not sure how to use this in a list with multiple strings and with specifying a target. Here is the desired output:
target_output = ['this', 'is', 'my apple', 'and', 'this', 'is', 'not', 'your', 'apple']
['this', 'is', 'my apple', 'and', 'this', 'is', 'not', 'your', 'apple']
So I was wondering if anyone knows how to detokenize some specific words only if they are next to each other in a list?
The following seems simple enough:
def detokenize(sent, tgt):
i = 0
tgt_len = len(tgt.split()) # this allows for phrases longer than 2
while i < len(sent):
if " ".join(sent[i:i+tgt_len]) == tgt:
yield tgt
i += tgt_len
else:
yield sent[i]
i += 1
>>> list(detokenize(words, "my apple"))
['this', 'is', 'my apple', 'and', 'this', 'is', 'not', 'your', 'apple']
>>> list(detokenize(words, "this is not"))
['this', 'is', 'my', 'apple', 'and', 'this is not', 'your', 'apple']