I am trying to solve a multilabel text classification problem using BERT from huggingface transformers library. The model is defined as follows:
def create_model(encoder, nb_classes=3, lr=1e-5):
# inputs
input_ids = tf.keras.Input(shape=(512,), ragged=False,
dtype=tf.int32, name='input_ids')
input_attention_mask = tf.keras.Input(shape=(512,), ragged=False,
dtype=tf.int32, name='attention_mask')
# transformer
output = encoder({'input_ids': input_ids,
'attention_mask': input_attention_mask})[0]
Y = tf.keras.layers.BatchNormalization()(output)
Y = tf.keras.layers.Dense(nb_classes, activation='sigmoid')(Y)
# compilation
model = tf.keras.Model(inputs=[input_ids, input_attention_mask],
optimizer = tf.keras.optimizers.Adam(learning_rate=lr)
# losses
# loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
# loss = tf.keras.losses.BinaryCrossentropy(from_logits=False)
loss=multilabel_loss, metrics=['acc'])
return model
As you can see, I tried to use tf.keras.losses, but it did not work (throwing AttributeError: 'Tensor' object has no attribute 'nested_row_splits'
), so I defined a simple cross entropy by hand:
def multilabel_loss(y_true, y_pred):
y_pred = tf.convert_to_tensor(y_pred)
y_true = tf.cast(y_true, y_pred.dtype)
cross_entropy = -tf.reduce_sum((y_true*tf.math.log(y_pred + 1e-8) + (1 - y_true) * tf.math.log(1 - y_pred + 1e-8)),
return cross_entropy
The model is created with strategy.scope() as shown below, using 'distil-bert-uncased' as a checkpoint:
with strategy.scope():
encoder = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)
#encoder = TFRobertaForSequenceClassification.from_pretrained(checkpoint)
model = create_model(encoder)
The labels are binary arrays:
163350 [0, 0, 1]
118940 [0, 0, 1]
65243 [0, 0, 1]
30011 [0, 0, 1]
189713 [0, 1, 0]
They are combined with tokenized texts into a tf.dataset in a next function:
def tf_text_data_prep(df):
input: takes pandas dataframe
output: returns tokenized tf.Dataset
hugging_ds = Dataset.from_pandas(df)
tokenized_ds = hugging_ds.map(
remove_columns=["Text", '__index_level_0__'],
# Convert to tensorflow
tf_dataset = tokenized_ds.with_format("tensorflow")
features = {x: tf_dataset[x].to_tensor() for x in tokenizer.model_input_names}
tf_data = tf.data.Dataset.from_tensor_slices((features, tf_dataset["label"]))
return tf_data
The problem is when I launch the training, I get the error:
TypeError Traceback (most recent call last)
<ipython-input-62-720b4634d50e> in <module>()
----> 1 get_ipython().run_cell_magic('time', '', 'steps_per_epoch = int(BUFFER_SIZE // BATCH_SIZE)\nprint(\n f"Model Params:\\nbatch_size: {BATCH_SIZE}\\nEpochs: {EPOCHS}\\n"\n f"Step p. Epoch: {steps_per_epoch}\\n"\n f"Initial Learning rate: {INITAL_LEARNING_RATE}"\n)\nhistory = model.fit(\n train_ds,\n validation_data=val_ds,\n batch_size=BATCH_SIZE,\n epochs=EPOCHS,\n callbacks=callbacks,\n verbose=1,\n)')
12 frames
raise TypeError('One of the inputs does not have acceptable types.')
TypeError: One of the inputs does not have acceptable types.
This same approach worked for ordinary binary classification, but not for multilabel. I'd appreciate any help regarding the error or the approach in general.
The issue is that you are using TFAutoModelForSequenceClassification
i.e. ForSequenceClassification
and if you were to see it's summary you will find that it returns a Dense
output and hence it is not an encoder as you want it.
encoder = TFAutoModelForSequenceClassification.from_pretrained('bert-base-uncased')
All model checkpoint layers were used when initializing TFBertForSequenceClassification.
Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Model: "tf_bert_for_sequence_classification_2"
Layer (type) Output Shape Param #
bert (TFBertMainLayer) multiple 109482240
dropout_187 (Dropout) multiple 0
classifier (Dense) multiple 1538
Total params: 109,483,778
Trainable params: 109,483,778
Non-trainable params: 0
But you want to use it as encoder
and hence you would have to do it like this
from transformers import TFBertModel
encoder = TFBertModel.from_pretrained('bert-base-uncased')
model = create_model(encoder)
Model: "tf_bert_model_2"
Layer (type) Output Shape Param #
bert (TFBertMainLayer) multiple 109482240
Total params: 109,482,240
Trainable params: 109,482,240
Non-trainable params: 0
As you can see now you have your encoder as the output from the Bert
. Now the below line in the create_model
makes sense but it will give an error cause of below line in create_model
output = encoder({'input_ids': input_ids,
'attention_mask': input_attention_mask})[0]
This is cause the output at the 0
index is of shape (batch_size, token_length, embedding)
but we want the value of [CLS]
token which should be (batch_size, embedding)
and that is at the 1
index so we have to update to below line:
output = encoder({'input_ids': input_ids,
'attention_mask': input_attention_mask})[1]
Also, as of now you are fixing the input_shape
to 512
but we should specify the value to None
so that we can have variable length input as below
input_ids = tf.keras.Input(shape=(None,), ragged=False, dtype=tf.int32, name='input_ids')
input_attention_mask = tf.keras.Input(shape=(None,), ragged=False, dtype=tf.int32, name='attention_mask')
After doing all these change below is the result of a sample run.
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
encoder = TFBertModel.from_pretrained('bert-base-uncased')
model = create_model(encoder)
inputs = tokenizer('hello world', return_tensors='tf')
model.predict((inputs['input_ids'], inputs['attention_mask']))
array([[0.7867866 , 0.65974414, 0.45628983]], dtype=float32)