tensorflow keras huggingface-transformers bert-language-model huggingface-tokenizers

Problem with inputs when building a model with TFBertModel and AutoTokenizer from HuggingFace's transformers

I'm trying to build the model illustrated in this picture:

I obtained a pre-trained BERT and respective tokenizer from HuggingFace's transformers in the following way:

from transformers import AutoTokenizer, TFBertModel
model_name = "dbmdz/bert-base-italian-xxl-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
bert = TFBertModel.from_pretrained(model_name)

The model will be fed a sequence of italian tweets and will need to determine if they are ironic or not.

I'm having problems building the initial part of the model, which takes the inputs and feeds them to the tokenizer in order to get a representation I can feed to BERT.

I can do it outside of the model-building context:

my_phrase = "Ciao, come va?"
# an equivalent version is tokenizer(my_phrase, other parameters)
bert_input = tokenizer.encode(my_phrase, add_special_tokens=True, return_tensors='tf', max_length=110, padding='max_length', truncation=True) 
attention_mask = bert_input > 0
outputs = bert(bert_input, attention_mask)['pooler_output']

but I'm having troubles building a model that does this. Here is the code for building such a model (the problem is in the first 4 lines ):

def build_classifier_model():
  text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
  encoder_inputs = tokenizer(text_input, return_tensors='tf', add_special_tokens=True, max_length=110, padding='max_length', truncation=True)
  outputs = bert(encoder_inputs)
  net = outputs['pooler_output']
  
  X = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(net)
  X = tf.keras.layers.Concatenate(axis=-1)([X, input_layer])
  X = tf.keras.layers.MaxPooling1D(20)(X)
  X = tf.keras.layers.SpatialDropout1D(0.4)(X)
  X = tf.keras.layers.Flatten()(X)
  X = tf.keras.layers.Dense(128, activation="relu")(X)
  X = tf.keras.layers.Dropout(0.25)(X)
  X = tf.keras.layers.Dense(2, activation='softmax')(X)

  model = tf.keras.Model(inputs=text_input, outputs = X) 
  
  return model

And when I call the function for creating this model I get this error:

text input must of type str (single example), List[str] (batch or single pretokenized example) or List[List[str]] (batch of pretokenized examples).

One thing I thought was that maybe I had to use the tokenizer.batch_encode_plus function which works with lists of strings:

class BertPreprocessingLayer(tf.keras.layers.Layer):
  def __init__(self, tokenizer, maxlength):
    super().__init__()
    self._tokenizer = tokenizer
    self._maxlength = maxlength
  
  def call(self, inputs):
    print(type(inputs))
    print(inputs)
    tokenized = tokenizer.batch_encode_plus(inputs, add_special_tokens=True, return_tensors='tf', max_length=self._maxlength, padding='max_length', truncation=True)
    return tokenized

def build_classifier_model():
  text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
  encoder_inputs = BertPreprocessingLayer(tokenizer, 100)(text_input)
  outputs = bert(encoder_inputs)
  net = outputs['pooler_output']
  # ... same as above

but I get this error:

batch_text_or_text_pairs has to be a list (got <class 'keras.engine.keras_tensor.KerasTensor'>)

and beside the fact I haven't found a way to convert that tensor to a list with a quick google search, it seems weird that I have to go in and out of tensorflow in this way.

I've also looked up on the huggingface's documentation but there is only a single usage example, with a single phrase, and what they do is analogous at my "out of model-building context" example.

EDIT:

I also tried with Lambdas in this way:

tf.executing_eagerly()

def tokenize_tensor(tensor):
  t = tensor.numpy()
  t = np.array([str(s, 'utf-8') for s in t])
  return tokenizer(t.tolist(), return_tensors='tf', add_special_tokens=True, max_length=110, padding='max_length', truncation=True)

def build_classifier_model():
  text_input = tf.keras.layers.Input(shape=(1,), dtype=tf.string, name='text')
  
  encoder_inputs = tf.keras.layers.Lambda(tokenize_tensor, name='tokenize')(text_input)
  ...
  
  outputs = bert(encoder_inputs)

but I get the following error:

'Tensor' object has no attribute 'numpy'

EDIT 2:

I also tried the approach suggested by @mdaoust of wrapping everything in a tf.py_function and got this error.

def py_func_tokenize_tensor(tensor):
  return tf.py_function(tokenize_tensor, [tensor], Tout=[tf.int32, tf.int32, tf.int32])

eager_py_func() missing 1 required positional argument: 'Tout'

Then I defined Tout as the type of the value returned by the tokenizer:

transformers.tokenization_utils_base.BatchEncoding

and got the following error:

Expected DataType for argument 'Tout' not <class 'transformers.tokenization_utils_base.BatchEncoding'>

Finally I unpacked the value in the BatchEncoding in the following way:

def tokenize_tensor(tensor):
  t = tensor.numpy()
  t = np.array([str(s, 'utf-8') for s in t])
  dictionary = tokenizer(t.tolist(), return_tensors='tf', add_special_tokens=True, max_length=110, padding='max_length', truncation=True)
  #unpacking
  input_ids = dictionary['input_ids']
  tok_type = dictionary['token_type_ids']
  attention_mask = dictionary['attention_mask']
  return input_ids, tok_type, attention_mask

And get an error in the line below:

...
outputs = bert(encoder_inputs)

ValueError: Cannot take the length of shape with unknown rank.

Solution

For now I solved by taking the tokenization step out of the model:

def tokenize(sentences, tokenizer):
    input_ids, input_masks, input_segments = [],[],[]
    for sentence in sentences:
        inputs = tokenizer.encode_plus(sentence, add_special_tokens=True, max_length=128, pad_to_max_length=True, return_attention_mask=True, return_token_type_ids=True)
        input_ids.append(inputs['input_ids'])
        input_masks.append(inputs['attention_mask'])
        input_segments.append(inputs['token_type_ids'])        
        
    return np.asarray(input_ids, dtype='int32'), np.asarray(input_masks, dtype='int32'), np.asarray(input_segments, dtype='int32')

The model takes two inputs which are the first two values returned by the tokenize funciton.

def build_classifier_model():
   input_ids_in = tf.keras.layers.Input(shape=(128,), name='input_token', dtype='int32')
   input_masks_in = tf.keras.layers.Input(shape=(128,), name='masked_token', dtype='int32') 

   embedding_layer = bert(input_ids_in, attention_mask=input_masks_in)[0]
...
   model = tf.keras.Model(inputs=[input_ids_in, input_masks_in], outputs = X)

   for layer in model.layers[:3]:
     layer.trainable = False
   return model

I'd still like to know if someone has a solution which integrates the tokenization step inside the model-building context so that an user of the model can simply feed phrases to it to get a prediction or to train the model.