python pytorch huggingface-transformers langchain

langchain pipeline vram usage when loading model

I'm trying to load 6b 128b 8bit llama based model from file (note the model itself is an example, I tested others and got similar problems), the pipeline is completely eating up my 8gb of vram:

before pipeline generation

error during pipeline generation

My code:

from langchain.llms import HuggingFacePipeline
from langchain import PromptTemplate, LLMChain

import torch
from transformers import LlamaTokenizer, LlamaForCausalLM, LlamaConfig, pipeline

torch.cuda.set_device(torch.device("cuda:0"))

PATH = './models/wizardLM-7B-GPTQ-4bit-128g'
config = LlamaConfig.from_json_file(f'{PATH}/config.json')
base_model = LlamaForCausalLM(config=config).half()

torch.cuda.empty_cache()
tokenizer = LlamaTokenizer.from_pretrained(
    pretrained_model_name_or_path=PATH,
    low_cpu_mem_usage=True,
    local_files_only=True
)
torch.cuda.empty_cache()

pipe = pipeline(
    "text-generation",
    model=base_model,
    tokenizer=tokenizer,
    batch_size=1,
    device=0,
    max_length=100,
    temperature=0.6,
    top_p=0.95,
    repetition_penalty=1.2
)

How can I make the pipeline initiation consume less vram?

gpu: AMD® Radeon rx 6600 (8gb vram, rocm 5.4.2 & torch)

I want to mention that I managed to load the same model on other frameworks like "KoboldAI" or "text-generation-webui" so I know it should be possible.

To load the model "wizardLM-7B-GPTQ-4bit-128g" downloaded from huggingface and run it using with langchain on python.

pip list output:

    Package                  Version
------------------------ ----------------
accelerate               0.19.0
aiofiles                 23.1.0
aiohttp                  3.8.4
aiosignal                1.3.1
altair                   5.0.0
anyio                    3.6.2
argilla                  1.7.0
async-timeout            4.0.2
attrs                    23.1.0
backoff                  2.2.1
beautifulsoup4           4.12.2
bitsandbytes             0.39.0
certifi                  2022.12.7
cffi                     1.15.1
chardet                  5.1.0
charset-normalizer       2.1.1
chromadb                 0.3.23
click                    8.1.3
clickhouse-connect       0.5.24
cmake                    3.25.0
colorclass               2.2.2
commonmark               0.9.1
compressed-rtf           1.0.6
contourpy                1.0.7
cryptography             40.0.2
cycler                   0.11.0
dataclasses-json         0.5.7
datasets                 2.12.0
Deprecated               1.2.13
dill                     0.3.6
duckdb                   0.8.0
easygui                  0.98.3
ebcdic                   1.1.1
et-xmlfile               1.1.0
extract-msg              0.41.1
fastapi                  0.95.2
ffmpy                    0.3.0
filelock                 3.9.0
fonttools                4.39.4
frozenlist               1.3.3
fsspec                   2023.5.0
gradio                   3.28.3
gradio_client            0.2.5
greenlet                 2.0.2
h11                      0.14.0
hnswlib                  0.7.0
httpcore                 0.16.3
httptools                0.5.0
httpx                    0.23.3
huggingface-hub          0.14.1
idna                     3.4
IMAPClient               2.3.1
Jinja2                   3.1.2
joblib                   1.2.0
jsonschema               4.17.3
kiwisolver               1.4.4
langchain                0.0.171
lark-parser              0.12.0
linkify-it-py            2.0.2
lit                      15.0.7
llama-cpp-python         0.1.50
loralib                  0.1.1
lxml                     4.9.2
lz4                      4.3.2
Markdown                 3.4.3
markdown-it-py           2.2.0
MarkupSafe               2.1.2
marshmallow              3.19.0
marshmallow-enum         1.5.1
matplotlib               3.7.1
mdit-py-plugins          0.3.3
mdurl                    0.1.2
monotonic                1.6
mpmath                   1.2.1
msg-parser               1.2.0
msoffcrypto-tool         5.0.1
multidict                6.0.4
multiprocess             0.70.14
mypy-extensions          1.0.0
networkx                 3.0
nltk                     3.8.1
numexpr                  2.8.4
numpy                    1.24.1
nvidia-cublas-cu11       11.10.3.66
nvidia-cuda-cupti-cu11   11.7.101
nvidia-cuda-nvrtc-cu11   11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11        8.5.0.96
nvidia-cufft-cu11        10.9.0.58
nvidia-curand-cu11       10.2.10.91
nvidia-cusolver-cu11     11.4.0.1
nvidia-cusparse-cu11     11.7.4.91
nvidia-nccl-cu11         2.14.3
nvidia-nvtx-cu11         11.7.91
olefile                  0.46
oletools                 0.60.1
openai                   0.27.7
openapi-schema-pydantic  1.2.4
openpyxl                 3.1.2
orjson                   3.8.12
packaging                23.1
pandas                   1.5.3
pandoc                   2.3
pcodedmp                 1.2.6
pdfminer.six             20221105
Pillow                   9.3.0
pip                      23.0.1
plumbum                  1.8.1
ply                      3.11
posthog                  3.0.1
psutil                   5.9.5
pyarrow                  12.0.0
pycparser                2.21
pydantic                 1.10.7
pydub                    0.25.1
Pygments                 2.15.1
pygpt4all                1.1.0
pygptj                   2.0.3
pyllamacpp               2.3.0
pypandoc                 1.11
pyparsing                2.4.7
pyrsistent               0.19.3
python-dateutil          2.8.2
python-docx              0.8.11
python-dotenv            1.0.0
python-magic             0.4.27
python-multipart         0.0.6
python-pptx              0.6.21
pytorch-triton-rocm      2.0.1
pytz                     2023.3
pytz-deprecation-shim    0.1.0.post0
PyYAML                   6.0
red-black-tree-mod       1.20
regex                    2023.5.5
requests                 2.28.1
responses                0.18.0
rfc3986                  1.5.0
rich                     13.0.1
RTFDE                    0.0.2
scikit-learn             1.2.2
scipy                    1.10.1
semantic-version         2.10.0
sentence-transformers    2.2.2
sentencepiece            0.1.99
setuptools               66.0.0
six                      1.16.0
sniffio                  1.3.0
soupsieve                2.4.1
SQLAlchemy               2.0.15
starlette                0.27.0
sympy                    1.11.1
tabulate                 0.9.0
tenacity                 8.2.2
threadpoolctl            3.1.0
tokenizers               0.13.3
toolz                    0.12.0
torch                    2.0.1+rocm5.4.2
torchaudio               2.0.2+rocm5.4.2
torchvision              0.15.2+rocm5.4.2
tqdm                     4.65.0
transformers             4.30.0.dev0
triton                   2.0.0
typer                    0.9.0
typing_extensions        4.4.0
typing-inspect           0.8.0
tzdata                   2023.3
tzlocal                  4.2
uc-micro-py              1.0.2
unstructured             0.6.6
urllib3                  1.26.13
uvicorn                  0.22.0
uvloop                   0.17.0
watchfiles               0.19.0
websockets               11.0.3
wheel                    0.38.4
wikipedia                1.4.0
wrapt                    1.14.1
XlsxWriter               3.1.0
xxhash                   3.2.0
yarl                     1.9.2
zstandard                0.21.0

Solution

I assume you are trying to load this model: TheBloke/wizardLM-7B-GPTQ. This model can not be loaded directly with the transformers library as it was 4bit quantized, but you can load it with AutoGPTQ:

pip install auto-gptq

import torch
from transformers import LlamaTokenizer, pipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig


quantize_config = BaseQuantizeConfig(**{"bits": 4, "damp_percent": 0.01, "desc_act": True, "group_size": 128})

model_id = 'TheBloke/wizardLM-7B-GPTQ'

# I downloaded the model from the hub due to name conflicts
m = AutoGPTQForCausalLM.from_quantized("/tmp/blabla/",  device="cuda:0", quantize_config=quantize_config, use_safetensors=True)


t = LlamaTokenizer.from_pretrained(
    pretrained_model_name_or_path=model_id,
)

pipe = pipeline(
    "text-generation",
    model=m,
    tokenizer=t,
    batch_size=1,
    device=0,
    max_length=100,
    temperature=0.6,
    top_p=0.95,
    repetition_penalty=1.2
)

pipe("Please give me an life changing advise.")

Output:

[{'generated_text': 'Please give me an life changing advise.\nI am a 28 year old woman and I have been struggling with anxiety for the past few years now. It has affected my personal and professional life greatly. I have tried various therapies, medications etc but nothing seems to work long term. Recently, I started practicing meditation regularly and it has helped me immensely in reducing my anxiety levels. However, I still struggle with social situations and public speaking. Can'}]