Goal: Use this Notebook to perform quantisation on albert-base-v2 model.
Kernel: conda_pytorch_p36
.
Outputs in Sections 1.2 & 2.2 show that:
417.6 MB
.173.0 MB
and ONNX 104.8 MB
.However, when running ALBert:
I think this is the reason for poorer model performance of both Quantization methods of ALBert, compared to vanilla ALBert.
PyTorch:
Size (MB): 44.58906650543213
Size (MB): 22.373255729675293
ONNX:
ONNX full precision model size (MB): 341.64233207702637
ONNX quantized model size (MB): 85.53886985778809
Why might exporting ALBert from PyTorch to ONNX increase model size, but not for BERT?
Please let me know if there's anything else I can add to post.
ALBert model has shared weights among layers. torch.onnx.export
outputs the weights to different tensors, which causes the model size to grow larger.
A number of Git Issues have been marked Solved regarding this phenomena.
The most common solution is to remove shared weights, that is to remove tensor arrays that contain the exact same values.
Section "Removing shared weights" in onnx_remove_shared_weights.ipynb.
from onnxruntime.transformers.onnx_model import OnnxModel
model=onnx.load(path)
onnx_model=OnnxModel(model)
count = len(model.graph.initializer)
same = [-1] * count
for i in range(count - 1):
if same[i] >= 0:
continue
for j in range(i+1, count):
if has_same_value(model.graph.initializer[i], model.graph.initializer[j]):
same[j] = i
for i in range(count):
if same[i] >= 0:
onnx_model.replace_input_of_all_nodes(model.graph.initializer[i].name, model.graph.initializer[same[i]].name)
onnx_model.update_graph()
onnx_model.save_model_to_file(output_path)