python tensorflow machine-learning keras huggingface-transformers

Tensorflow/Keras/BERT MultiClass Text Classification Accuracy

I'm attempting to fine-tune the HuggingFace TFBertModel to be able to classify some text to a single label. I have the model up and running, however the accuracy is extremely low from the start. My expectation is that the accuracy would be high given that it is using the BERT pre-trained weights as a starting point. I was hoping to get some advice on where I'm going wrong.

I'm using the bbc-text dataset from here:

Load Data

df = pd.read_csv(open(<s3 url>),encoding='utf-8', error_bad_lines=False)
df = df.sample(frac=1)
df = df.dropna(how='any')

Value Counts

sport            511
business         510
politics         417
tech             401
entertainment    386
Name: label, dtype: int64

Preprocessing

def preprocess_text(sen):
# Convert html entities to normal
sentence = unescape(sen)

# Remove html tags
sentence = remove_tags(sentence)

# Remove newline chars
sentence = remove_newlinechars(sentence)

# Remove punctuations and numbers
sentence = re.sub('[^a-zA-Z]', ' ', sentence)

# Convert to lowercase
sentence = sentence.lower()

return sentence


def remove_newlinechars(text):
    return " ".join(text.splitlines()) 

def remove_tags(text):
    TAG_RE = re.compile(r'<[^>]+>')
    return TAG_RE.sub('', text)

df['text_prepd'] = df['text'].apply(preprocess_text)

Split Data

train, val = train_test_split(df, test_size=0.30, shuffle=True, stratify=df['label'])

Encode Labels

from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y_train = np.asarray(le.fit_transform(train['label']))
y_val = np.asarray(le.fit_transform(val['label']))

Define BERT input function

# Initialise Bert Tokenizer
bert_tokenizer_transformer = BertTokenizer.from_pretrained('bert-base-cased')

def create_input_array(df, tokenizer, args):
    sentences = df.text_prepd.values

    input_ids = []
    attention_masks = []
    token_type_ids = []

    for sent in tqdm(sentences):
        # `encode_plus` will:
        #   (1) Tokenize the sentence.
        #   (2) Prepend the `[CLS]` token to the start.
        #   (3) Append the `[SEP]` token to the end.
        #   (4) Map tokens to their IDs.
        #   (5) Pad or truncate the sentence to `max_length`
        #   (6) Create attention masks for [PAD] tokens.
        encoded_dict = tokenizer.encode_plus(
            sent,  # Sentence to encode.
            add_special_tokens=True,  # Add '[CLS]' and '[SEP]'
            max_length=args.max_seq_len,  # Pad & truncate all sentences.
                pad_to_max_length=True,
                return_attention_mask=True,  # Construct attn. masks.
                return_tensors='tf',  # Return tf tensors.
            )

        # Add the encoded sentence to the list.
        input_ids.append(encoded_dict['input_ids'])

        # And its attention mask (simply differentiates padding from non-padding).
        attention_masks.append(encoded_dict['attention_mask'])

        token_type_ids.append(encoded_dict['token_type_ids'])

    input_ids = tf.convert_to_tensor(input_ids)
    attention_masks = tf.convert_to_tensor(attention_masks)
    token_type_ids = tf.convert_to_tensor(token_type_ids)

    return input_ids, attention_masks, token_type_ids

Convert Data to Bert Inputs

train_inputs = [create_input_array(train[:], tokenizer=tokenizer, args=args)]
val_inputs = [create_input_array(val[:], tokenizer=tokenizer, args=args)]

For train_inputs, y_train and val_inputs, y_val I then apply the below function which reshapes and converts to numpy arrays. The returned list from this function is then passed as arguments to the keras fit method. I realise this is a bit overkill converting to tf.tensors then to numpy, but I don't think this has an impact. I was originally trying to use tf.datasets but switched to numpy.

def convert_inputs_to_tf_dataset(inputs,y, args):
    # args.max_seq_len = 256
    ids = inputs[0][1]
    masks = inputs[0][1]
    token_types = inputs[0][2]

    ids = tf.reshape(ids, (-1, args.max_seq_len))
    print("Input ids shape: ", ids.shape)
    masks = tf.reshape(masks, (-1, args.max_seq_len))
    print("Input Masks shape: ", masks.shape)
    token_types = tf.reshape(token_types, (-1, args.max_seq_len))
    print("Token type ids shape: ", token_types.shape)

    ids=ids.numpy()
    masks = masks.numpy()
    token_types = token_types.numpy()

    return [ids, masks, token_types, y]

Keras Model

# args.max_seq_len = 256
# n_classes = 6
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', trainable=True, num_labels=n_classes)

input_ids_layer = Input(shape=(args.max_seq_len, ), dtype=np.int32)
input_mask_layer = Input(shape=(args.max_seq_len, ), dtype=np.int32)
input_token_type_layer = Input(shape=(args.max_seq_len,), dtype=np.int32)

bert_layer = model([input_ids_layer, input_mask_layer, input_token_type_layer])[0]
flat_layer = Flatten()(bert_layer)
dropout= Dropout(0.3)(flat_layer)
dense_output = Dense(n_classes, activation='softmax')(dropout)

model_ = Model(inputs=[input_ids_layer, input_mask_layer, input_token_type_layer], outputs=dense_output)

Compile and Fit

loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer='adam', loss=loss, metrics=[metric])
model.fit(inputs=..., outputs=..., validation_data=..., epochs=50, batch_size = 32, metrics=metric, verbose=1)


Epoch 32/50
1401/1401 [==============================] - 42s 30ms/sample - loss: 1.6103 - accuracy: 0.2327 - val_loss: 1.6042 -
 val_accuracy: 0.2308

As I'm using BERT, only a few epochs are necessary, so I was expecting something much higher than 23% after 32 epochs.

Solution

The main problem is in this line: ids = inputs[0][1]. Actually, the ids are the first element of inputs[0]; so it should be ids = inputs[0][0].

But there is also another problem which might result in inconsistent validation accuracy: you should fit the LabelEncoder only one time to construct the label mapping; so you should use the transform method, instead of fit_transform, on validation labels.

Further, don't use both softmax activation and from_logits=True in loss function simultaneously; only use either of them (see here for more info).

Another point is that you might need to use a lower learning rate for the optimizer. The default learning rate of Adam optimizer is 1e-3, which might be too high considering that you are fine-tuning a pretrained model. Try a lower learning rate, say 1e-4 or 1e-5; e.g. tf.keras.optimizers.Adam(learning_rate=1e-4). A high learning rate for fine-tuning a pretrained model might destroy the learned weights and disrupts fine-tuning process (due to the large gradient values generated, especially at the start of fine-tuning process).