python deep-learning huggingface-transformers bert-language-model quantization

Use Quantization on HuggingFace Transformers models

I'm learning Quantization, and am experimenting with Section 1 of this notebook.

I want to use this code on my own models.

Hypothetically, I only need to assign to model variable in Section 1.2

# load model
model = BertForSequenceClassification.from_pretrained(configs.output_dir)
model.to(configs.device)

My models are from a different library: from transformers import pipeline. So .to() throws an AttributeError.

My Model:

pip install transformers

from transformers import pipeline

unmasker = pipeline('fill-mask', model='bert-base-uncased')
model = unmasker("Hello I'm a [MASK] model.")

Output:

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

How might I run the linked Quantization code on my example model?

Please let me know if there's anything else I should clarify in this post.

Solution

The pipeline approach won't work for Quantisation as we need the models to be returned. You can however, use pipeline for testing the original models for timing etc.

Quantisation Code:

token_logits contains the tensors of the quantised model.

You could place a for-loop around this code, and replace model_name with string from a list.

model_name = bert-base-uncased
tokenizer = AutoTokenizer.from_pretrained(model_name )
model = AutoModelForMaskedLM.from_pretrained(model_name)
    
sequence = "Distilled models are smaller than the models they mimic. Using them instead of the large " \
f"versions would help {tokenizer.mask_token} our carbon footprint."

inputs = tokenizer(sequence, return_tensors="pt")
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
    
token_logits = model(**inputs).logits

# <- can stop here

Source