I'm attempting to fine-tune the HuggingFace TFBertModel to be able to classify some text to a single label. I have the model up and running, however the accuracy is extremely low from the start. My expectation is that the accuracy would be high given that it is using the BERT pre-trained weights as a starting point. I was hoping to get some advice on where I'm going wrong.
I'm using the bbc-text dataset from here:
Load Data
df = pd.read_csv(open(<s3 url>),encoding='utf-8', error_bad_lines=False)
df = df.sample(frac=1)
df = df.dropna(how='any')
Value Counts
sport 511
business 510
politics 417
tech 401
entertainment 386
Name: label, dtype: int64
Preprocessing
def preprocess_text(sen):
# Convert html entities to normal
sentence = unescape(sen)
# Remove html tags
sentence = remove_tags(sentence)
# Remove newline chars
sentence = remove_newlinechars(sentence)
# Remove punctuations and numbers
sentence = re.sub('[^a-zA-Z]', ' ', sentence)
# Convert to lowercase
sentence = sentence.lower()
return sentence
def remove_newlinechars(text):
return " ".join(text.splitlines())
def remove_tags(text):
TAG_RE = re.compile(r'<[^>]+>')
return TAG_RE.sub('', text)
df['text_prepd'] = df['text'].apply(preprocess_text)
Split Data
train, val = train_test_split(df, test_size=0.30, shuffle=True, stratify=df['label'])
Encode Labels
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
y_train = np.asarray(le.fit_transform(train['label']))
y_val = np.asarray(le.fit_transform(val['label']))
Define BERT input function
# Initialise Bert Tokenizer
bert_tokenizer_transformer = BertTokenizer.from_pretrained('bert-base-cased')
def create_input_array(df, tokenizer, args):
sentences = df.text_prepd.values
input_ids = []
attention_masks = []
token_type_ids = []
for sent in tqdm(sentences):
# `encode_plus` will:
# (1) Tokenize the sentence.
# (2) Prepend the `[CLS]` token to the start.
# (3) Append the `[SEP]` token to the end.
# (4) Map tokens to their IDs.
# (5) Pad or truncate the sentence to `max_length`
# (6) Create attention masks for [PAD] tokens.
encoded_dict = tokenizer.encode_plus(
sent, # Sentence to encode.
add_special_tokens=True, # Add '[CLS]' and '[SEP]'
max_length=args.max_seq_len, # Pad & truncate all sentences.
pad_to_max_length=True,
return_attention_mask=True, # Construct attn. masks.
return_tensors='tf', # Return tf tensors.
)
# Add the encoded sentence to the list.
input_ids.append(encoded_dict['input_ids'])
# And its attention mask (simply differentiates padding from non-padding).
attention_masks.append(encoded_dict['attention_mask'])
token_type_ids.append(encoded_dict['token_type_ids'])
input_ids = tf.convert_to_tensor(input_ids)
attention_masks = tf.convert_to_tensor(attention_masks)
token_type_ids = tf.convert_to_tensor(token_type_ids)
return input_ids, attention_masks, token_type_ids
Convert Data to Bert Inputs
train_inputs = [create_input_array(train[:], tokenizer=tokenizer, args=args)]
val_inputs = [create_input_array(val[:], tokenizer=tokenizer, args=args)]
For train_inputs, y_train
and val_inputs, y_val
I then apply the below function which reshapes and converts to numpy arrays. The returned list from this function is then passed as arguments to the keras fit method. I realise this is a bit overkill converting to tf.tensors then to numpy, but I don't think this has an impact. I was originally trying to use tf.datasets but switched to numpy.
def convert_inputs_to_tf_dataset(inputs,y, args):
# args.max_seq_len = 256
ids = inputs[0][1]
masks = inputs[0][1]
token_types = inputs[0][2]
ids = tf.reshape(ids, (-1, args.max_seq_len))
print("Input ids shape: ", ids.shape)
masks = tf.reshape(masks, (-1, args.max_seq_len))
print("Input Masks shape: ", masks.shape)
token_types = tf.reshape(token_types, (-1, args.max_seq_len))
print("Token type ids shape: ", token_types.shape)
ids=ids.numpy()
masks = masks.numpy()
token_types = token_types.numpy()
return [ids, masks, token_types, y]
Keras Model
# args.max_seq_len = 256
# n_classes = 6
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', trainable=True, num_labels=n_classes)
input_ids_layer = Input(shape=(args.max_seq_len, ), dtype=np.int32)
input_mask_layer = Input(shape=(args.max_seq_len, ), dtype=np.int32)
input_token_type_layer = Input(shape=(args.max_seq_len,), dtype=np.int32)
bert_layer = model([input_ids_layer, input_mask_layer, input_token_type_layer])[0]
flat_layer = Flatten()(bert_layer)
dropout= Dropout(0.3)(flat_layer)
dense_output = Dense(n_classes, activation='softmax')(dropout)
model_ = Model(inputs=[input_ids_layer, input_mask_layer, input_token_type_layer], outputs=dense_output)
Compile and Fit
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer='adam', loss=loss, metrics=[metric])
model.fit(inputs=..., outputs=..., validation_data=..., epochs=50, batch_size = 32, metrics=metric, verbose=1)
Epoch 32/50
1401/1401 [==============================] - 42s 30ms/sample - loss: 1.6103 - accuracy: 0.2327 - val_loss: 1.6042 -
val_accuracy: 0.2308
As I'm using BERT, only a few epochs are necessary, so I was expecting something much higher than 23% after 32 epochs.
The main problem is in this line: ids = inputs[0][1]
. Actually, the ids are the first element of inputs[0]
; so it should be ids = inputs[0][0]
.
But there is also another problem which might result in inconsistent validation accuracy: you should fit the LabelEncoder
only one time to construct the label mapping; so you should use the transform
method, instead of fit_transform
, on validation labels.
Further, don't use both softmax
activation and from_logits=True
in loss function simultaneously; only use either of them (see here for more info).
Another point is that you might need to use a lower learning rate for the optimizer. The default learning rate of Adam optimizer is 1e-3, which might be too high considering that you are fine-tuning a pretrained model. Try a lower learning rate, say 1e-4 or 1e-5; e.g. tf.keras.optimizers.Adam(learning_rate=1e-4)
. A high learning rate for fine-tuning a pretrained model might destroy the learned weights and disrupts fine-tuning process (due to the large gradient values generated, especially at the start of fine-tuning process).