Given a list of tokens, input:
>>> tokenized_text = "[CLS] my dog is cute [SEP] he likes slack ##ing [SEP]".split()
>>> tokenized_text
['[CLS]', 'my', 'dog', 'is', 'cute', '[SEP]', 'he', 'likes', 'slack', '##ing', '[SEP]']
The goal is to create an index for up till every [SEP]
from left to right, find the [SEP]
tokens and then incrementally adds the 1 after every [SEP]
, so the desired output indices for the tokenize_text
list above is:
[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
I've tried:
# Find the indices of `[SEP]`.
>>> sep_indices = np.array(np.where(np.array(tokenized_text) == "[SEP]"))[0]
>>> sep_indices
array([ 5, 10])
>>> prev = 0
>>> out =[]
>>> for i, idx in enumerate(sep_indices):
... for _ in range(idx-prev):
... out.append(i)
... prev = idx
...
>>> out = [0] + out[:-1]
>>> out
[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
But is there an easier way to achieve the correct output?
Easier and vectorized way with NumPy -
In [116]: a = np.asarray(tokenized_text)
In [117]: m = a == "[SEP]"
In [118]: m.cumsum()-m
Out[118]: array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1])