I am Playing around with Bert Pretrained Models (bert-large-uncased-whole-word-masking) I used Huggingface to try it I first Used this Piece of Code
m = TFBertLMHeadModel.from_pretrained("bert-large-cased-whole-word-masking")
logits = m(tokenizer("hello world [MASK] like it",return_tensors="tf")["input_ids"]).logits
I then used Argmax to get max probabilities after applying softmax, Things works fine Until now.
When I used padding with max_length = 100 The model started making false prediction and not working well and all predicted tokens were the same i.e 119-Token ID
Code I used for Argmax
tf.argmax(tf.keras.activations.softmax(m(tokenizer("hello world [MASK] like it",return_tensors="tf",max_length=,padding="max_length")["input_ids"]).logits)[0],axis=-1)
Output Before using padding
<tf.Tensor: shape=(7,), dtype=int64, numpy=array([ 9800, 19082, 1362, 146, 1176, 1122, 119])>
Output After using padding with max_length of 100
<tf.Tensor: shape=(100,), dtype=int64, numpy=
array([119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
119, 119, 119, 119, 119, 119, 119, 119, 119])>
I wonder if this problem prevail even training a new model as It is mandatory to set Input shape for training new model I Padded and tokenized the data but, now I want to know if this problem continues with it too.
As already mentioned in the comments, you forgot to pass the attention_mask to BERT and it, therefore, treated the added padding tokens like ordinary tokens.
You also asked in the comments how you can rid of the padding token prediction. There are several ways to do it depending on your actual task. One of them is removing them with boolean_mask and the attention_mask as shown below:
import tensorflow as tf
from transformers import TFBertLMHeadModel, BertTokenizerFast
ckpt = "bert-large-cased-whole-word-masking"
t = BertTokenizerFast.from_pretrained(ckpt)
m = TFBertLMHeadModel.from_pretrained(ckpt)
e = t("hello world [MASK] like it",return_tensors="tf")
e_padded = t("hello world [MASK] like it",return_tensors="tf", padding="max_length", max_length = 100)
def prediction(encoding):
logits = m(**encoding).logits
token_mapping = tf.argmax(tf.keras.activations.softmax(logits),axis=-1)
return tf.boolean_mask(token_mapping, encoding["attention_mask"])
token_predictions = prediction(e)
token_predictions_padded = prediction(e_padded)
print(token_predictions)
print(token_predictions_padded)
Output:
tf.Tensor([ 9800 19082 1362 146 1176 1122 119], shape=(7,), dtype=int64)
tf.Tensor([ 9800 19082 1362 146 1176 1122 119], shape=(7,), dtype=int64)