Slow Ollama API - how to make sure the GPU is used

I made a simple demo for a chatbox interface in Godot, using which you can chat with a language model, which runs using Ollama. Currently, the interface between Godot and the language model is based on the Ollama API. The response time is about 30 seconds.

If I chat directly with the LM using the Ollama CLI, the response time is much lower (less than 1 sec), and it's noticeably lower even if I interact with the API using Curl (curl http://localhost:11434/api/generate -d '{ "model": "qwen2:1.5b", "prompt": "What is water made of?", "stream": false}').

Here is the code snippet I am using to interact with Ollama:

func send_to_ollama(message):
    var url = "http://localhost:11434/api/generate"
    var headers = ["Content-Type: application/json"]
    var body = JSON.stringify({
        "model": "qwen2:1.5b",
        "prompt": message,
        "stream": false
    })

Do you spot anything wrong? Am I calling the API correctly? Should I add somehow that I want Ollama to use the GPU?

Solution

It is NOT slow it appears to be slow

CLI spits output word by word immediately after hitting enter. In contrast, 'langchain' collects the entire output first, consuming 15-20 seconds, depending on the length of the response, and then spits out ... Boom... Even subprocess.run() has the same effect.

Workaround:

import os os.system('ollama run llama3.2:1b what is water short answer ') and then run the python script from the terminal: python main.py

Here, you can see output almost immediately as a stream.

Save the output in a text file that can be used in your Python script.

os.system('ollama run llama3.2:1b what is water short answer > output.txt')

to append the text file:

os.system('ollama run llama3.2:1b what is water short answer >> output.txt') I have posted this answer on GitHub as well