Search code examples
godotgodot4ollama

Slow Ollama API - how to make sure the GPU is used


I made a simple demo for a chatbox interface in Godot, using which you can chat with a language model, which runs using Ollama. Currently, the interface between Godot and the language model is based on the Ollama API. The response time is about 30 seconds.

If I chat directly with the LM using the Ollama CLI, the response time is much lower (less than 1 sec), and it's noticeably lower even if I interact with the API using Curl (curl http://localhost:11434/api/generate -d '{ "model": "qwen2:1.5b", "prompt": "What is water made of?", "stream": false}').

Here is the code snippet I am using to interact with Ollama:

func send_to_ollama(message):
    var url = "http://localhost:11434/api/generate"
    var headers = ["Content-Type: application/json"]
    var body = JSON.stringify({
        "model": "qwen2:1.5b",
        "prompt": message,
        "stream": false
    })

Do you spot anything wrong? Am I calling the API correctly? Should I add somehow that I want Ollama to use the GPU?


Solution

  • It is NOT slow it appears to be slow

    CLI spits output word by word immediately after hitting enter. In contrast, 'langchain' collects the entire output first, consuming 15-20 seconds, depending on the length of the response, and then spits out ... Boom... Even subprocess.run() has the same effect.

    Workaround:

    import os os.system('ollama run llama3.2:1b what is water short answer ') and then run the python script from the terminal: python main.py

    Here, you can see output almost immediately as a stream.

    Save the output in a text file that can be used in your Python script.

    os.system('ollama run llama3.2:1b what is water short answer > output.txt')

    to append the text file:

    os.system('ollama run llama3.2:1b what is water short answer >> output.txt') I have posted this answer on GitHub as well