deep-learning nlp bert-language-model named-entity-recognition

Training CamelBERT model for token classification

I am trying to use a huggingface model (CamelBERT) for token classification using ANERCorp Dataset. I fed the training set from ANERCorp to train the model, but I am getting the following error.

Error:

Some weights of the model checkpoint at CAMeL-Lab/bert-base-arabic-camelbert-ca were not used when initializing BertForTokenClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at CAMeL-Lab/bert-base-arabic-camelbert-ca and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
03/16/2022 07:31:01 - INFO - utils -   Creating features from dataset file at /content/drive/MyDrive/ANERcorp-CamelLabSplits
03/16/2022 07:31:01 - INFO - utils -   Writing example 0 of 3973
Traceback (most recent call last):
  File "/content/CAMeLBERT/token-classification/run_token_classification.py", line 381, in <module>
    main()
  File "/content/CAMeLBERT/token-classification/run_token_classification.py", line 226, in main
    if training_args.do_train
  File "/content/CAMeLBERT/token-classification/utils.py", line 132, in __init__
    pad_token_label_id=self.pad_token_label_id,
  File "/content/CAMeLBERT/token-classification/utils.py", line 210, in convert_examples_to_features
    label_ids.extend([label_map[label]] +
KeyError: 'B-LOC'

Please note: I am using Google Colab to train the model. Code:

DATA_DIR="/content/drive/MyDrive/ANERcorp-CamelLabSplits"
MAX_LENGTH=512
BERT_MODEL="CAMeL-Lab/bert-base-arabic-camelbert-ca"
OUTPUT_DIR="/content/Output"
BATCH_SIZE=32
NUM_EPOCHS=3
SAVE_STEPS=750
SEED=12345

!python /content/CAMeLBERT/token-classification/run_token_classification.py \
--data_dir $DATA_DIR \
--task_type ner \
--labels $DATA_DIR/train.txt \
--model_name_or_path $BERT_MODEL \
--output_dir $OUTPUT_DIR \
--max_seq_length $MAX_LENGTH \
--num_train_epochs $NUM_EPOCHS \
--per_device_train_batch_size $BATCH_SIZE \
--save_steps $SAVE_STEPS \
--seed $SEED \
--overwrite_output_dir \
--overwrite_cache \
--do_train \
--do_predict

Solution

The script you are using loads the labels from $DATA_DIR/train.txt.

See https://github.com/CAMeL-Lab/CAMeLBERT/blob/master/token-classification/run_token_classification.py#L105 for what the model expects.

It then tries to load the label list as first file file from the corpus (even before loading the training data), see https://github.com/CAMeL-Lab/CAMeLBERT/blob/master/token-classification/run_token_classification.py#L183 and put it into label_map.

But that fails for some reason. My assumption would be that it doensnt find anything and label_map is an empty dict, so the first attempt to get the labels from it fails with KeyError. Probably either your input data is not there or not in the path as expected (check if you have the right files and the right value for $DATA_DIR). From my experience relative paths in Google Drive can be tricky. Try something simple to see if it works, like os.listdir($DATA_DIR) to see if that is actually the directly you expect it to be.

If that is not the problem then probably something about the labels is actually wrong. Does ANERCorp use this exact way of writing labels (B-LOC etc.)? If it is different (e.g. B-Location or something) it would fail too.