Including allennlp predictor arguments in config.json

I'm training an allennlp crf_tagger. I'm using a predictor which is based on the SentenceTaggerPredictor. The issue is the tokenizer argument - in the case of the SentenceTaggerPredictor there's a language argument.

Since SentenceTaggerPredictor has language="en_core_web_sm" as a defauly argument, when I do

Predictor.from_path("model.tar.gz", "sentence_tagger")

The tokenizer is created using the default language. But what happens if the training data was tokenized using a different language. How do I specify the arguments for the predictor in the model config.json such that Predictor.from_path will be constructed using a non-default language?

Solution

The Predictor.from_path() method has an overrides parameter that you could use in this case. For example, Predictor.from_path("model.tar.gz", "sentence_tagger", overrides={"dataset_reader.tokenizer.language": "en"}).