python huggingface-transformers langchain large-language-model

What are my options for running LLMs locally from pretrained weights?

I have a cluster, that is not connected to the internet, although has a sort of weights repository available. I need to run LLM inference on it.

The only option that I found until now is using combination of transformers and langchain modules, but I don't want to tweak hyperparameters of models. I ran into ollama software, but I cannot install anything on cluster, except from python libs. So, naturally I wonder, what are my options for running LLM inference? And there is some more questions.

Can I just install ollama-python package and not install their linux software? Or do I need both to run my inference?
If I manage to install ollama on this cluster, how can I provide pretrained weights to the model? If it helps, they are stored in (sometime multiple) .bin files

Solution

You don't really have to install ollama. Instead, you can directly run the LLM, for example, for mistral model locally

llm = GPT4All(
    model="/home/jeff/.cache/huggingface/hub/gpt4all/mistral-7b-openorca.Q4_0.gguf",
    device='gpu', n_threads=8,
    callbacks=callbacks, verbose=True)

Or for falcon

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

model_id = "tiiuae/falcon-7b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
pipeline = pipeline(
    "text-generation",
    model=model_id,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    # trust_remote_code=True,
    device_map="auto",
    max_new_tokens=100,
    # max_length=200,
)


from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
llm = HuggingFacePipeline(pipeline=pipeline)

I have a 16G ram Nvidia 4090 installed on my laptop, which can support the above 2 models running locally.