Search code examples
pythondockermistral-7bollama

is there parallelism inside Ollama?


Below Python program is intended to translate large English texts into French. I use a for loop to feed a series of reports into Ollama.

from functools import cached_property

from ollama import Client


class TestOllama:

    @cached_property
    def ollama_client(self) -> Client:
        return Client(host=f"http://127.0.0.1:11434")

    def translate(self, text_to_translate: str):
        ollama_response = self.ollama_client.generate(
            model="mistral",
            prompt=f"translate this French text into English: {text_to_translate}"
        )
        return ollama_response['response'].lstrip(), ollama_response['total_duration']

    def run(self):
        reports = ["reports_text_1", "reports_text_2"....] # avearge text size per report is between 750-1000 tokens.
        for each_report in reports:
            try:
                translated_report, total_duration = self.translate(
                    text_to_translate=each_report
                )
                print(f"Translated text:{translated_report}, Time taken:{total_duration}")
            except Exception as e:
                pass


if __name__ == '__main__':
    job = TestOllama()
    job.run()

docker command to run ollama:

docker run -d --gpus=all --network=host --security-opt seccomp=unconfined -v report_translation_ollama:/root/.ollama --name ollama ollama/ollama

My question is: When I run this script on V100 and H100, I don't see a significant difference in execution time. I've avoided parallelism, thinking that Ollama might internally use parallelism to process. However, when I check with the htop command, I see only one core being used. Am I correct in my understanding?

I am a beginner in NLP, so any help or guidance on how to organize my code (e.g., using multithreading to send Ollama requests) would be appreciated.


Solution

  • Experimental flags OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS were added in v0.1.33. You can set them when starting the Ollama server:

    OLLAMA_NUM_PARALLEL=4 OLLAMA_MAX_LOADED_MODELS=4 ollama serve
    

    Available server settings

    • OLLAMA_MAX_LOADED_MODELS - The maximum number of models that can be loaded concurrently provided they fit in available memory. The default is 3 * the number of GPUs or 3 for CPU inference.
    • OLLAMA_NUM_PARALLEL - The maximum number of parallel requests each model will process at the same time. The default will auto-select either 4 or 1 based on available memory.
    • OLLAMA_MAX_QUEUE - The maximum number of requests Ollama will queue when busy before rejecting additional requests. The default is 512.

    Source: faq.md#how-does-ollama-handle-concurrent-requests