I'm trying to load 6b 128b 8bit llama based model from file (note the model itself is an example, I tested others and got similar problems), the pipeline is completely eating up my 8gb of vram:
My code:
from langchain.llms import HuggingFacePipeline
from langchain import PromptTemplate, LLMChain
import torch
from transformers import LlamaTokenizer, LlamaForCausalLM, LlamaConfig, pipeline
torch.cuda.set_device(torch.device("cuda:0"))
PATH = './models/wizardLM-7B-GPTQ-4bit-128g'
config = LlamaConfig.from_json_file(f'{PATH}/config.json')
base_model = LlamaForCausalLM(config=config).half()
torch.cuda.empty_cache()
tokenizer = LlamaTokenizer.from_pretrained(
pretrained_model_name_or_path=PATH,
low_cpu_mem_usage=True,
local_files_only=True
)
torch.cuda.empty_cache()
pipe = pipeline(
"text-generation",
model=base_model,
tokenizer=tokenizer,
batch_size=1,
device=0,
max_length=100,
temperature=0.6,
top_p=0.95,
repetition_penalty=1.2
)
How can I make the pipeline initiation consume less vram?
gpu: AMD® Radeon rx 6600 (8gb vram, rocm 5.4.2 & torch)
I want to mention that I managed to load the same model on other frameworks like "KoboldAI" or "text-generation-webui" so I know it should be possible.
To load the model "wizardLM-7B-GPTQ-4bit-128g" downloaded from huggingface and run it using with langchain on python.
pip list output:
Package Version
------------------------ ----------------
accelerate 0.19.0
aiofiles 23.1.0
aiohttp 3.8.4
aiosignal 1.3.1
altair 5.0.0
anyio 3.6.2
argilla 1.7.0
async-timeout 4.0.2
attrs 23.1.0
backoff 2.2.1
beautifulsoup4 4.12.2
bitsandbytes 0.39.0
certifi 2022.12.7
cffi 1.15.1
chardet 5.1.0
charset-normalizer 2.1.1
chromadb 0.3.23
click 8.1.3
clickhouse-connect 0.5.24
cmake 3.25.0
colorclass 2.2.2
commonmark 0.9.1
compressed-rtf 1.0.6
contourpy 1.0.7
cryptography 40.0.2
cycler 0.11.0
dataclasses-json 0.5.7
datasets 2.12.0
Deprecated 1.2.13
dill 0.3.6
duckdb 0.8.0
easygui 0.98.3
ebcdic 1.1.1
et-xmlfile 1.1.0
extract-msg 0.41.1
fastapi 0.95.2
ffmpy 0.3.0
filelock 3.9.0
fonttools 4.39.4
frozenlist 1.3.3
fsspec 2023.5.0
gradio 3.28.3
gradio_client 0.2.5
greenlet 2.0.2
h11 0.14.0
hnswlib 0.7.0
httpcore 0.16.3
httptools 0.5.0
httpx 0.23.3
huggingface-hub 0.14.1
idna 3.4
IMAPClient 2.3.1
Jinja2 3.1.2
joblib 1.2.0
jsonschema 4.17.3
kiwisolver 1.4.4
langchain 0.0.171
lark-parser 0.12.0
linkify-it-py 2.0.2
lit 15.0.7
llama-cpp-python 0.1.50
loralib 0.1.1
lxml 4.9.2
lz4 4.3.2
Markdown 3.4.3
markdown-it-py 2.2.0
MarkupSafe 2.1.2
marshmallow 3.19.0
marshmallow-enum 1.5.1
matplotlib 3.7.1
mdit-py-plugins 0.3.3
mdurl 0.1.2
monotonic 1.6
mpmath 1.2.1
msg-parser 1.2.0
msoffcrypto-tool 5.0.1
multidict 6.0.4
multiprocess 0.70.14
mypy-extensions 1.0.0
networkx 3.0
nltk 3.8.1
numexpr 2.8.4
numpy 1.24.1
nvidia-cublas-cu11 11.10.3.66
nvidia-cuda-cupti-cu11 11.7.101
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11 8.5.0.96
nvidia-cufft-cu11 10.9.0.58
nvidia-curand-cu11 10.2.10.91
nvidia-cusolver-cu11 11.4.0.1
nvidia-cusparse-cu11 11.7.4.91
nvidia-nccl-cu11 2.14.3
nvidia-nvtx-cu11 11.7.91
olefile 0.46
oletools 0.60.1
openai 0.27.7
openapi-schema-pydantic 1.2.4
openpyxl 3.1.2
orjson 3.8.12
packaging 23.1
pandas 1.5.3
pandoc 2.3
pcodedmp 1.2.6
pdfminer.six 20221105
Pillow 9.3.0
pip 23.0.1
plumbum 1.8.1
ply 3.11
posthog 3.0.1
psutil 5.9.5
pyarrow 12.0.0
pycparser 2.21
pydantic 1.10.7
pydub 0.25.1
Pygments 2.15.1
pygpt4all 1.1.0
pygptj 2.0.3
pyllamacpp 2.3.0
pypandoc 1.11
pyparsing 2.4.7
pyrsistent 0.19.3
python-dateutil 2.8.2
python-docx 0.8.11
python-dotenv 1.0.0
python-magic 0.4.27
python-multipart 0.0.6
python-pptx 0.6.21
pytorch-triton-rocm 2.0.1
pytz 2023.3
pytz-deprecation-shim 0.1.0.post0
PyYAML 6.0
red-black-tree-mod 1.20
regex 2023.5.5
requests 2.28.1
responses 0.18.0
rfc3986 1.5.0
rich 13.0.1
RTFDE 0.0.2
scikit-learn 1.2.2
scipy 1.10.1
semantic-version 2.10.0
sentence-transformers 2.2.2
sentencepiece 0.1.99
setuptools 66.0.0
six 1.16.0
sniffio 1.3.0
soupsieve 2.4.1
SQLAlchemy 2.0.15
starlette 0.27.0
sympy 1.11.1
tabulate 0.9.0
tenacity 8.2.2
threadpoolctl 3.1.0
tokenizers 0.13.3
toolz 0.12.0
torch 2.0.1+rocm5.4.2
torchaudio 2.0.2+rocm5.4.2
torchvision 0.15.2+rocm5.4.2
tqdm 4.65.0
transformers 4.30.0.dev0
triton 2.0.0
typer 0.9.0
typing_extensions 4.4.0
typing-inspect 0.8.0
tzdata 2023.3
tzlocal 4.2
uc-micro-py 1.0.2
unstructured 0.6.6
urllib3 1.26.13
uvicorn 0.22.0
uvloop 0.17.0
watchfiles 0.19.0
websockets 11.0.3
wheel 0.38.4
wikipedia 1.4.0
wrapt 1.14.1
XlsxWriter 3.1.0
xxhash 3.2.0
yarl 1.9.2
zstandard 0.21.0
I assume you are trying to load this model: TheBloke/wizardLM-7B-GPTQ. This model can not be loaded directly with the transformers library as it was 4bit quantized, but you can load it with AutoGPTQ:
pip install auto-gptq
import torch
from transformers import LlamaTokenizer, pipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
quantize_config = BaseQuantizeConfig(**{"bits": 4, "damp_percent": 0.01, "desc_act": True, "group_size": 128})
model_id = 'TheBloke/wizardLM-7B-GPTQ'
# I downloaded the model from the hub due to name conflicts
m = AutoGPTQForCausalLM.from_quantized("/tmp/blabla/", device="cuda:0", quantize_config=quantize_config, use_safetensors=True)
t = LlamaTokenizer.from_pretrained(
pretrained_model_name_or_path=model_id,
)
pipe = pipeline(
"text-generation",
model=m,
tokenizer=t,
batch_size=1,
device=0,
max_length=100,
temperature=0.6,
top_p=0.95,
repetition_penalty=1.2
)
pipe("Please give me an life changing advise.")
Output:
[{'generated_text': 'Please give me an life changing advise.\nI am a 28 year old woman and I have been struggling with anxiety for the past few years now. It has affected my personal and professional life greatly. I have tried various therapies, medications etc but nothing seems to work long term. Recently, I started practicing meditation regularly and it has helped me immensely in reducing my anxiety levels. However, I still struggle with social situations and public speaking. Can'}]