TFlearn - VocabularyProcessor ignores parts of given vocabulary

I am using the VocabularyProcessor of TFlearn to map documents to integer arrays. However, I don't seem to be able to initialize the VocabularyProcessor with my own vocabulary. In the docs it says that I can provide a vocabulary when creating the VocabularyProcessor as in:

vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length, vocabulary=vocab)

However, when creating the VocabularyProcessor like this I cannot transform my documents correctly. I am providing the vocabulary as a dictionary, using the word indices as values:

vocab={'hello':3, '.':5, 'world':20}

Sentences are provided as follows:

sentences = ['hello summer .', 'summer is here .', ...]

It's very important that the VocabularyProcessor uses the given indices to transform the documents, because each index references a certain word embedding. When calling

list(vocab_processor.transform(['hello world .', 'hello']))

the output is

[array([ 3, 20, 0]), array([3, 0, 0])]

So the sentences weren't transformed according to the provided vocabulary which maps '.' to 5. How do I provide the vocabulary to the VocabularyProcessor correctly?

Solution

Let's have some experiment to answer your question,

vocab={'hello':3, '.':5, 'world':20, '/' : 10}
sentences= ['hello world . / hello', 'hello']

vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length=6, vocabulary=vocab)
list(vocab_processor.transform(sentences))

The output of the following code segment is,

[array([ 3, 20,  3,  0,  0,  0]), array([3, 0, 0, 0, 0, 0])]

Now you may already see that space(' ') and dot('.') both are actually not tokenized. So in your code what happens is that tensorflow identifies only two words and pad an extra zero to make it max_document_length=3. To perform tokenization on them you may write your own tokenized function. A sample code is given below.

def my_func(iterator):
  return (x.split(" ") for x in iterator)

vocab={'hello':3, '.':5, 'world':20, '/' : 10}
sentences= ['hello world . / hello', 'hello']

vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length=6, vocabulary=vocab, tokenizer_fn = my_func)

list(vocab_processor.transform(sentences))

Now the output of the code segment is like

[array([ 3, 20,  5, 10,  3,  0]), array([3, 0, 0, 0, 0, 0])]

which is the expected output. hope this makes your confusion clear.

Your next confusion might be what are the values that will be tokenized by default. Let me post here the original source so that you can never be confused,

TOKENIZER_RE = re.compile(r"[A-Z]{2,}(?![a-z])|[A-Z][a-z]+(?=[A-Z])|[\'\w\-]+",
                          re.UNICODE)
def tokenizer(iterator):
  """Tokenizer generator.
  Args:
    iterator: Input iterator with strings.
  Yields:
    array of tokens per each value in the input.
  """
  for value in iterator:
    yield TOKENIZER_RE.findall(value)

but my suggestion will be, "write your own function and be confident"

also, I would like to point out few things if you missed(hopefully not). If you are using transform() function, your min_frequency argument won't be working as it does not fit the data. try to see the effect in the following code,

for i in range(6):
    vocab_processor = learn.preprocessing.VocabularyProcessor(
        max_document_length=7, min_frequency=i)
    tokens = vocab_processor.transform(["a b c d e f","a b c d e","a b c" , "a b", "a"])
    print(list(vocab_processor.transform(sentences))[0] )

output :

[1 2 3 4 5 6 0]
[1 2 3 4 5 6 0]
[1 2 3 4 5 6 0]
[1 2 3 4 5 6 0]
[1 2 3 4 5 6 0]
[1 2 3 4 5 6 0]

Again for a slight similar code,

for i in range(6):
    vocab_processor = learn.preprocessing.VocabularyProcessor(
        max_document_length=7, min_frequency=i)
    tokens = vocab_processor.fit_transform(["a b c d e f","a b c d e","a b c" , "a b", "a"])
    print(list(tokens)[0])

Output:

[1 2 3 4 5 6 0]
[1 2 3 4 5 0 0]
[1 2 3 0 0 0 0]
[1 2 0 0 0 0 0]
[1 0 0 0 0 0 0]
[0 0 0 0 0 0 0]