Search code examples
pythontensorflowdeep-learningtokenizebert-language-model

bert_vocab.bert_vocab_from_dataset returning wrong vocabulary


i'm trying to build a tokenizer following the tf's tutorial https://www.tensorflow.org/text/guide/subwords_tokenizer. I'm basically doing the same thing only with a different dataset. The dataset in question is a txt file in which the first two columns are an english sentence or word and the translation in italian, here a snippet:

Hi. Ciao!   CC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #607364 (Cero)
Hi. Ciao.   CC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #4522287 (Guybrush88)
Run!    Corri!  CC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #906347 (Guybrush88)
Run!    Corra!  CC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #906348 (Guybrush88)
Run!    Correte!    CC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #906350 (Guybrush88)
Who?    Chi?    CC-BY 2.0 (France) Attribution: tatoeba.org #2083030 (CK) & #2126402 (Guybrush88)

it can be downloaded at http://www.manythings.org/anki/

i've preprocessed it and turned the english and italian sentences to tensorflow datasets to be fed to the tokenizer as illustrated in this code:

import tensorflow as tf
from tensorflow_text.tools.wordpiece_vocab import bert_vocab_from_dataset as bert_vocab
import tensorflow_text as tf_text
import os
import numpy as np

eng_dataset, ita_dataset = np.genfromtxt('ita_eng_dataset.txt',
                                         usecols=(0, 1),
                                         encoding='utf-8',
                                         unpack=True,
                                         dtype='str')

eng_dataset_tensor = tf.convert_to_tensor(eng_dataset)
ita_dataset_tensor = tf.convert_to_tensor(ita_dataset)

eng_tf_dataset = tf.data.Dataset.from_tensor_slices(eng_dataset_tensor)
ita_tf_dataset = tf.data.Dataset.from_tensor_slices(ita_dataset_tensor)

The problems arise when i try to fed it to bert_vocab_from_dataset:

bert_tokenizer_params = dict(lower_case=True)
reserved_tokens = ["[PAD]", "[UNK]", "[START]", "[END]"]

bert_vocab_args = dict(
    # The target vocabulary size
    vocab_size=8000,
    # Reserved tokens that must be included in the vocabulary
    reserved_tokens=reserved_tokens,
    # Arguments for `text.BertTokenizer`
    bert_tokenizer_params=bert_tokenizer_params,
    # Arguments for `wordpiece_vocab.wordpiece_tokenizer_learner_lib.learn`
    learn_params={},
)

eng_vocab = bert_vocab.bert_vocab_from_dataset(eng_tf_dataset, **bert_vocab_args)
ita_vocab = bert_vocab.bert_vocab_from_dataset(ita_tf_dataset, **bert_vocab_args)

but the results are wrong:

print(eng_vocab[:20])
print(ita_vocab[1980:2000])
print(len(eng_vocab), len(ita_vocab))

which outputs

['about', 'breakfast', 'coffee', 'correct', 'finally', 'heat', 'japanese', 'large', 'lie', 'old', 'peel', 'science', 'step', 'swimming', 'work', '##ans', '##b', '##der', '##ins', '##ish']
['##omfortable', '##ong', '##ony', '##op', '##ouse', '##ply', '##rch', '##rous', '##rove', '##roved', '##sists', '##tained', '##ten', '##unted', '##val', '##ze', 'advice', 'agitated', 'amazed', 'argued']
665 2413

as you can see the italian vocabulary contains english text and are both very little (it maybe due to the dataset but seems odd to be so small, less than 1000 vocabs even).

I also tried batching the input dataset as in the tensorflow tutorial but it gave the same results.

I'm using python 3.8 on pycharm with windows 11 and tensorflow 2.10


Solution

  • Solved, it was just the np.genfromtxt non using '\t' as delimiter by default.