I want to create a self hosted LLM model that will be able to have a context of my own custom data (Slack conversations for that matter).
I've heard Vicuna is a great alternative to ChatGPT and so I made the below code:
from llama_index import SimpleDirectoryReader, LangchainEmbedding, GPTListIndex, \
GPTSimpleVectorIndex, PromptHelper, LLMPredictor, Document, ServiceContext
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
import torch
from langchain.llms.base import LLM
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
!export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
class CustomLLM(LLM):
model_name = "eachadea/vicuna-13b-1.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
pipeline = pipeline("text2text-generation", model=model, tokenizer=tokenizer, device=0,
model_kwargs={"torch_dtype":torch.bfloat16})
def _call(self, prompt, stop=None):
return self.pipeline(prompt, max_length=9999)[0]["generated_text"]
def _identifying_params(self):
return {"name_of_model": self.model_name}
def _llm_type(self):
return "custom"
llm_predictor = LLMPredictor(llm=CustomLLM())
But sadly I'm hitting the below error:
OutOfMemoryError: CUDA out of memory. Tried to allocate 270.00 MiB (GPU 0; 22.03 GiB total capacity; 21.65 GiB
already allocated; 94.88 MiB free; 21.65 GiB reserved in total by PyTorch) If reserved memory is >> allocated
memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and
PYTORCH_CUDA_ALLOC_CONF
Here's the output of !nvidia-smi
(before running anything):
Thu Apr 20 18:04:00 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A10G Off| 00000000:00:1E.0 Off | 0 |
| 0% 23C P0 52W / 300W| 0MiB / 23028MiB | 18% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
Any idea how to modify my code to make it work?
length is too long, 9999 will consume huge amount of GPU RAM, especially using 13b model. try 7b model. And try using something like peft/bitsandbytes to reduce GPU RAM usage. set load_in_8bit=True is a good start.