Currently Helsinki-NLP/opus-mt-es-en model takes around 1.5sec for inference from transformer. How can that be reduced? Also when trying to convert it to onxx runtime getting this error:
ValueError: Unrecognized configuration class <class 'transformers.models.marian.configuration_marian.MarianConfig'> for this kind of AutoModel: AutoModel. Model type should be one of RetriBertConfig, MT5Config, T5Config, DistilBertConfig, AlbertConfig, CamembertConfig, XLMRobertaConfig, BartConfig, LongformerConfig, RobertaConfig, LayoutLMConfig, SqueezeBertConfig, BertConfig, OpenAIGPTConfig, GPT2Config, MobileBertConfig, TransfoXLConfig, XLNetConfig, FlaubertConfig, FSMTConfig, XLMConfig, CTRLConfig, ElectraConfig, ReformerConfig, FunnelConfig, LxmertConfig, BertGenerationConfig, DebertaConfig, DPRConfig, XLMProphetNetConfig, ProphetNetConfig, MPNetConfig, TapasConfig.
Is it possible to convert this to onxx runtime?
The OPUS models are originally trained with Marian which is a highly optimized toolkit for machine translation written fully in C++. Unlike PyTorch, it does have the ambition to be a general deep learning toolkit, so it can focus on MT efficiency. The Marian configurations and instructions on how to download the models are at
The OPUS-MT models for Huggingface's Transformers are converted from the original Marian models are meant more for prototyping and analyzing the models rather than for using them for translation in a production-like setup.
Running the models in Marian will certainly much faster than in Python and it is certainly much easier than hacking Transformers to run with onxx runtime. Marian also offers further tricks to speed up the translation, e.g., by model quantization, which is however at the expense of the translation quality.
With both Marian and Tranformers, you can speed things up if you use GPU or if you narrow the beam width during decoding (attribute num_beams
in the generate
method in Transformers).