Search code examples

Are these normal speed of Bert Pretrained Model Inference in PyTorch

I am testing Bert base and Bert distilled model in Huggingface with 4 scenarios of speeds, batch_size = 1:

1) bert-base-uncased: 154ms per request
2) bert-base-uncased with quantifization: 94ms per request
3) distilbert-base-uncased: 86ms per request
4) distilbert-base-uncased with quantifization: 69ms per request

I am using the IMDB text as experimental data and set the max_length=512, so it's quite long. The cpu on Ubuntu 18.04 info is below:

cat /proc/cpuinfo  | grep 'name'| uniq
model name  : Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz

The machine has 3 GPU available for use:

Tesla V100-SXM2

It seems quite slow for realtime application. Are those speeds normal for bert base model?

The testing code is below:

import pandas as pd
import torch.quantization

from transformers import AutoTokenizer, AutoModel, DistilBertTokenizer, DistilBertModel

def get_embedding(model, tokenizer, text):
    inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
    outputs = model(**inputs)
    output_tensors = outputs[0][0]
    output_numpy = output_tensors.detach().numpy()
    embedding = output_numpy.tolist()[0]

def process_text(model, tokenizer, text_lines):
    for index, line in enumerate(text_lines):
        embedding = get_embedding(model, tokenizer, line)
        if index % 100 == 0:
            print('Current index: {}'.format(index))

import time
from datetime import timedelta
if __name__ == "__main__":

    df = pd.read_csv('../data/train.csv', sep='\t')
    df = df.head(1000)
    text_lines = df['review']
    text_line_count = len(text_lines)
    print('Text size: {}'.format(text_line_count))

    start = time.time()

    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
    model = AutoModel.from_pretrained("bert-base-uncased")
    process_text(model, tokenizer, text_lines)

    end = time.time()
    print('Total time spent with bert base: {}'.format(str(timedelta(seconds=end - start))))

    model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
    process_text(model, tokenizer, text_lines)

    end2 = time.time()
    print('Total time spent with bert base quantization: {}'.format(str(timedelta(seconds=end2 - end))))

    tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
    model = DistilBertModel.from_pretrained("distilbert-base-uncased")
    process_text(model, tokenizer, text_lines)

    end3 = time.time()
    print('Total time spent with distilbert: {}'.format(str(timedelta(seconds=end3 - end2))))

    model = DistilBertModel.from_pretrained("distilbert-base-uncased")
    model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
    process_text(model, tokenizer, text_lines)

    end4 = time.time()
    print('Total time spent with distilbert quantization: {}'.format(str(timedelta(seconds=end4 - end3))))

EDIT: based on suggestion I changed to the following:

inputs = tokenizer(text_batch, padding=True, return_tensors="pt")
outputs = model(**inputs)

Where text_batch is a list of text as input.


  • No, you can speed it up.

    First, why are you testing it with batch size 1?

    Both tokenizer and model accept batched inputs. Basically, you can pass a 2D array/list that contains a single sample at each row. See the documentation for tokenizer: The same applies for the models.

    Also, your for loop is sequential even if you use batch size larger than 1. You can create a test data and then use Trainer class with trainer.predict()

    Also see this discussion of mine at the HF forums: