I'm using Torchtext for some NLP tasks, specifically using the built-in embeddings.
I want to be able to do a inverse vector search: Generate a noisy vector, find the vector that is closest to it, then get back the word that is "closest" to the noisy vector.
From the torchtext docs, here's how to attach embeddings to a built-in dataset:
from torchtext.vocab import GloVe
from torchtext import data
embedding = GloVe(name='6B', dim=100)
# Set up fields
TEXT = data.Field(lower=True, include_lengths=True, batch_first=True)
LABEL = data.Field(sequential=False, is_target=True)
# make splits for data
train, test = datasets.IMDB.splits(TEXT, LABEL)
# build the vocabulary
TEXT.build_vocab(train, vectors=embedding, max_size=100000)
LABEL.build_vocab(train)
# Get an example vector
embedding.get_vecs_by_tokens("germany")
Then we can build the annoy index:
from annoy import AnnoyIndex
num_trees = 50
ann_index = AnnoyIndex(embedding_dims, 'angular')
# Iterate through each vector in the embedding and add it to the index
for vector_num, vector in enumerate(TEXT.vocab.vectors):
ann_index.add_item(vector_num, vector) # Here's the catch: will vector_num correspond to torchtext.vocab.Vocab.itos?
ann_index.build(num_trees)
Then say I want to retrieve a word using a noisy vector:
# Get an existing vector
original_vec = embedding.get_vecs_by_tokens("germany")
# Add some noise to it
noise = generate_noise_vector(ndims=100)
noisy_vector = original_vec + noise
# Get the vector closest to the noisy vector
closest_item_idx = ann_index.get_nns_by_vector(noisy_vector, 1)[0]
# Get word from noisy item
noisy_word = TEXT.vocab.itos[closest_item_idx]
My question comes in for the last two lines above: The ann_index
was built using enumerate
over the embedding
object, which is a Torch tensor.
The [vocab][2]
object has its own itos
list that given an index returns a word.
My question is this: Can I be certain that the order in which words appear in the itos list, is the same as the order in TEXT.vocab.vectors
? How can I map one index to the other?
Can I be certain that the order in which words appear in the
itos
list, is the same as the order inTEXT.vocab.vectors
?
Yes.
The Field
class will always instantiate a Vocab
object (source), and since you are passing the pre-trained vectors to TEXT.build_vocab
, the Vocab
constructor will call the load_vectors
function.
if vectors is not None:
self.load_vectors(vectors, unk_init=unk_init, cache=vectors_cache)
In the load_vectors
, the vectors
are filled by enumerating the words in the itos
.
for i, token in enumerate(self.itos):
start_dim = 0
for v in vectors:
end_dim = start_dim + v.dim
self.vectors[i][start_dim:end_dim] = v[token.strip()]
start_dim = end_dim
assert(start_dim == tot_dim)
Therefore, you can be certain that itos
and vectors
will have the same order.