Search code examples
machine-learningnlphuggingface-transformersbert-language-model

Huggingface Bert, Which Bert flavor is the fastest to train for debugging?


I am working with Bert and the library https://huggingface.co/models hugginface. I was wondering which of the models available you would choose for debugging?

In other words which models trains/loads the fast on my GPU, to get runs as fast as possible? Albert, distillbert or?


Solution

  • I think generally using a specific model for debugging can be critical, and depends entirely on the kind of debugging you want to perform.

    Specifically, consider the aspect of tokenization: Since each model also carries their own derivation of the BaseTokenizer class. Therefore, any specifics of the respective model will only show up if you also use this specific tokenizer; say, e.g., you want to debug a (later) RoBERTa implementation by using DistilBert for debugging. Anything specific to RoBERTa's tokenization will not be the same in DistilBERT, which uses BERT's tokenizer. Similarly, any specifics to the training process might completely screw up the training. From anecdotal evidence, I had models train to completion (and convergence) with RoBERTa but not on BERT, which makes the proposed solution of using different models for "debugging" a potentially dangerous substitution. ALBERT again has properties different from any of the above mentioned models, but analogously, the mentioned aspects still hold.

    If you want to prototype services and simply require a model for in between, I think both of the models suggested by you would do just fine, and there should be only a minor difference in loading/saving depending on the exact number of model parameters. But keep in mind that inference time for applications is also something that is worth considering. Unless you are absolutely sure that there will not be any noticeable difference in the execution time, at least make sure that you are testing with the full model as well.