bert-language-model huggingface-transformers transformer-model huggingface-tokenizers

Are these normal speed of Bert Pretrained Model Inference in PyTorch

I am testing Bert base and Bert distilled model in Huggingface with 4 scenarios of speeds, batch_size = 1:

1) bert-base-uncased: 154ms per request
2) bert-base-uncased with quantifization: 94ms per request
3) distilbert-base-uncased: 86ms per request
4) distilbert-base-uncased with quantifization: 69ms per request

I am using the IMDB text as experimental data and set the max_length=512, so it's quite long. The cpu on Ubuntu 18.04 info is below:

cat /proc/cpuinfo  | grep 'name'| uniq
model name  : Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz

The machine has 3 GPU available for use:

Tesla V100-SXM2

It seems quite slow for realtime application. Are those speeds normal for bert base model?

The testing code is below:

import pandas as pd
import torch.quantization

from transformers import AutoTokenizer, AutoModel, DistilBertTokenizer, DistilBertModel

def get_embedding(model, tokenizer, text):
    inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
    outputs = model(**inputs)
    output_tensors = outputs[0][0]
    output_numpy = output_tensors.detach().numpy()
    embedding = output_numpy.tolist()[0]

def process_text(model, tokenizer, text_lines):
    for index, line in enumerate(text_lines):
        embedding = get_embedding(model, tokenizer, line)
        if index % 100 == 0:
            print('Current index: {}'.format(index))

import time
from datetime import timedelta
if __name__ == "__main__":

    df = pd.read_csv('../data/train.csv', sep='\t')
    df = df.head(1000)
    text_lines = df['review']
    text_line_count = len(text_lines)
    print('Text size: {}'.format(text_line_count))

    start = time.time()

    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
    model = AutoModel.from_pretrained("bert-base-uncased")
    process_text(model, tokenizer, text_lines)

    end = time.time()
    print('Total time spent with bert base: {}'.format(str(timedelta(seconds=end - start))))

    model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
    process_text(model, tokenizer, text_lines)

    end2 = time.time()
    print('Total time spent with bert base quantization: {}'.format(str(timedelta(seconds=end2 - end))))

    tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
    model = DistilBertModel.from_pretrained("distilbert-base-uncased")
    process_text(model, tokenizer, text_lines)

    end3 = time.time()
    print('Total time spent with distilbert: {}'.format(str(timedelta(seconds=end3 - end2))))

    model = DistilBertModel.from_pretrained("distilbert-base-uncased")
    model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
    process_text(model, tokenizer, text_lines)

    end4 = time.time()
    print('Total time spent with distilbert quantization: {}'.format(str(timedelta(seconds=end4 - end3))))

EDIT: based on suggestion I changed to the following:

inputs = tokenizer(text_batch, padding=True, return_tensors="pt")
outputs = model(**inputs)

Where text_batch is a list of text as input.

Solution

No, you can speed it up.

First, why are you testing it with batch size 1?

Both tokenizer and model accept batched inputs. Basically, you can pass a 2D array/list that contains a single sample at each row. See the documentation for tokenizer: https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer.__call__ The same applies for the models.

Also, your for loop is sequential even if you use batch size larger than 1. You can create a test data and then use Trainer class with trainer.predict()

Also see this discussion of mine at the HF forums: https://discuss.huggingface.co/t/urgent-trainer-predict-and-model-generate-creates-totally-different-predictions/3426