bert-language-model allennlp roberta-language-model srl

How to change AllenNLP BERT based Semantic Role Labeling to RoBERTa in AllenNLP

Currently i'm able to train a Semantic Role Labeling model using the config file below. This config file is based on the one provided by AllenNLP and works for the default bert-base-uncased model and also GroNLP/bert-base-dutch-cased.

{
  "dataset_reader": {
    "type": "srl_custom",
    "bert_model_name": "GroNLP/bert-base-dutch-cased"
  },
  "data_loader": {
    "batch_sampler": {
      "type": "bucket",
      "batch_size": 32
    }
  },
  "train_data_path": "./data/SRL/SONAR_1_SRL/MANUAL500/",
  "validation_data_path": "./data/SRL/SONAR_1_SRL/MANUAL500/",
  "model": {
    "type": "srl_bert",
    "embedding_dropout": 0.1,
    "bert_model": "GroNLP/bert-base-dutch-cased"
  },
  "trainer": {
    "optimizer": {
      "type": "huggingface_adamw",
      "lr": 5e-5,
      "correct_bias": false,
      "weight_decay": 0.01,
      "parameter_groups": [
        [
          [
            "bias",
            "LayerNorm.bias",
            "LayerNorm.weight",
            "layer_norm.weight"
          ],
          {
            "weight_decay": 0.0
          }
        ]
      ]
    },
    "learning_rate_scheduler": {
      "type": "slanted_triangular"
    },
    "checkpointer": {
      "keep_most_recent_by_count": 2
    },
    "grad_norm": 1.0,
    "num_epochs": 3,
    "validation_metric": "+f1-measure-overall"
  }
}

Swapping the values of bert_model_name and bert_model parameters from GroNLP/bert-base-dutch-cased to roberta-base won't work out of the box since the SRL datareader only supports the BertTokenizer and not the RobertaTokenizer. So I changed the config file to the following:

{
  "dataset_reader": {
    "type": "srl_custom",
    "token_indexers": {
      "tokens": {
        "type": "pretrained_transformer",
        "model_name": "roberta-base"
      }
    }
  },
  "data_loader": {
    "batch_sampler": {
      "type": "bucket",
      "batch_size": 32
    }
  },
  "train_data_path": "./data/SRL/SONAR_1_SRL/MANUAL500/",
  "validation_data_path": "./data/SRL/SONAR_1_SRL/MANUAL500/",
  "model": {
    "type": "srl_bert",
    "embedding_dropout": 0.1,
    "bert_model": "roberta-base"
  },
  "trainer": {
    "optimizer": {
      "type": "huggingface_adamw",
      "lr": 5e-5,
      "correct_bias": false,
      "weight_decay": 0.01,
      "parameter_groups": [
        [
          [
            "bias",
            "LayerNorm.bias",
            "LayerNorm.weight",
            "layer_norm.weight"
          ],
          {
            "weight_decay": 0.0
          }
        ]
      ]
    },
    "learning_rate_scheduler": {
      "type": "slanted_triangular"
    },
    "checkpointer": {
      "keep_most_recent_by_count": 2
    },
    "grad_norm": 1.0,
    "num_epochs": 15,
    "validation_metric": "+f1-measure-overall"
  }
}

However, this is still not working. I'm receiving the following error:

2022-02-22 16:19:34,122 - INFO - allennlp.training.gradient_descent_trainer - Training
  0%|          | 0/1546 [00:00<?, ?it/s]2022-02-22 16:19:34,142 - INFO - allennlp.data.samplers.bucket_batch_sampler - No sorting keys given; trying to guess a good one
2022-02-22 16:19:34,142 - INFO - allennlp.data.samplers.bucket_batch_sampler - Using ['tokens'] as the sorting keys
  0%|          | 0/1546 [00:00<?, ?it/s]
2022-02-22 16:19:34,526 - CRITICAL - root - Uncaught exception
Traceback (most recent call last):
  File "C:\Program Files\Python39\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Program Files\Python39\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\Scripts\allennlp.exe\__main__.py", line 7, in <module>
    sys.exit(run())
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\__main__.py", line 39, in run
    main(prog="allennlp")
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\commands\__init__.py", line 119, in main
    args.func(args)
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\commands\train.py", line 111, in train_model_from_args
    train_model_from_file(
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\commands\train.py", line 177, in train_model_from_file
    return train_model(
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\commands\train.py", line 258, in train_model
    model = _train_worker(
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\commands\train.py", line 508, in _train_worker
    metrics = train_loop.run()
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\commands\train.py", line 581, in run
    return self.trainer.train()
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\training\gradient_descent_trainer.py", line 771, in train
    metrics, epoch = self._try_train()
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\training\gradient_descent_trainer.py", line 793, in _try_train
    train_metrics = self._train_epoch(epoch)
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\training\gradient_descent_trainer.py", line 510, in _train_epoch
    batch_outputs = self.batch_outputs(batch, for_training=True)
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\training\gradient_descent_trainer.py", line 403, in batch_outputs
    output_dict = self._pytorch_model(**batch)
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp_models\structured_prediction\models\srl_bert.py", line 141, in forward
    bert_embeddings, _ = self.bert_model(
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\transformers\models\bert\modeling_bert.py", line 989, in forward
    embedding_output = self.embeddings(
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\transformers\models\bert\modeling_bert.py", line 215, in forward
    token_type_embeddings = self.token_type_embeddings(token_type_ids)
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\sparse.py", line 156, in forward
    return F.embedding(
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\torch\nn\functional.py", line 1916, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

I don't fully understand whats going wrong and couldn't find any documentation on how to change the config file to load in a 'custom' BERT/RoBERTa model (one thats not mentioned here). I'm running the default allennlp train config.jsonnet command to start training. allennlp train config.jsonnet --dry-run produces no errors however.

Thanks in advance! Thijs

EDIT: I've now swapped out and inherited the "srl_bert" for a custom "srl_roberta" class to make use of the RobertaModel. This however still produces the same error.

EDIT2: I'm now using the AutoTokenizer as suggested by Dirk Groeneveld. It looks like changing the SrlReader class to support RoBERTa based models involves way more changes like swapping BERTs wordpiece tokenizer to RoBERTa's BPE tokenizer. Is there an easy way to adapt the SrlReader class or is it better to write a new RobertaSrlReader from scratch?

I've inherited the SrlReader class and changed this line to the following:

self.bert_tokenizer = AutoTokenizer.from_pretrained(bert_model_name)

It produces the following error since RoBERTa tokenization differs from BERT:

  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp_models\structured_prediction\dataset_readers\srl.py", line 255, in text_to_instance
    wordpieces, offsets, start_offsets = self._wordpiece_tokenize_input(
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp_models\structured_prediction\dataset_readers\srl.py", line 196, in _wordpiece_tokenize_input
    word_pieces = self.bert_tokenizer.wordpiece_tokenizer.tokenize(token)
AttributeError: 'RobertaTokenizerFast' object has no attribute 'wordpiece_tokenizer'

Solution

The easiest way to resolve this is to patch SrlReader so that it uses PretrainedTransformerTokenizer (from AllenNLP) or AutoTokenizer (from Huggingface) instead of BertTokenizer. SrlReader is an old class, and was written against an old version of the Huggingface tokenizer API, so it's not so easy to upgrade.

If you want to submit a pull request in the AllenNLP project, I'd be happy to help you get it merged into AllenNLP!