How to run ollama in google colab?

I have a code like this. And I'm launching it. I get an ngrok link.

!pip install aiohttp pyngrok

import os
import asyncio
from aiohttp import ClientSession

# Set LD_LIBRARY_PATH so the system NVIDIA library becomes preferred
# over the built-in library. This is particularly important for
# Google Colab which installs older drivers
os.environ.update({'LD_LIBRARY_PATH': '/usr/lib64-nvidia'})

async def run(cmd):
  '''
  run is a helper function to run subcommands asynchronously.
  '''
  print('>>> starting', *cmd)
  p = await asyncio.subprocess.create_subprocess_exec(
      *cmd,
      stdout=asyncio.subprocess.PIPE,
      stderr=asyncio.subprocess.PIPE,
  )

  async def pipe(lines):
    async for line in lines:
      print(line.strip().decode('utf-8'))

  await asyncio.gather(
      pipe(p.stdout),
      pipe(p.stderr),
  )


await asyncio.gather(
    run(['ollama', 'serve']),
    run(['ngrok', 'http', '--log', 'stderr', '11434']),
)

Which I'm following, but the following is on the page

How can I fix this? Before that, I did the following

!choco install ngrok
!ngrok config add-authtoken -----

!curl https://ollama.ai/install.sh | sh
!command -v systemctl >/dev/null && sudo systemctl stop ollama

Solution

1. Run ollama but don't stop it

!curl https://ollama.ai/install.sh | sh

# should produce, among other thigns:
# The Ollama API is now available at 0.0.0.0:11434

This means Ollama is running (but do check to see if there are errors, especially around graphics capability/Cuda as these may interfere.

However, Don't run !command -v systemctl >/dev/null && sudo systemctl stop ollama (unless you want to stop Ollama).

The next step is to start the Ollama service, but since you are using ngrok I'm assuming you want to be able to run the LLM from other environments outside the Colab? If this isn't the case, then you don't really need ngrok, but since Colabs are tricky to get working nicely with async code and threads it's useful to use the Colab to e.g. run a powerful enough VM to play with larger models than (say) anthing you could run on your dev environment (if this is an issue).

2. Set up ngrok and forward the local ollama service to a public URI

Ollama isn't yet running as a service but we can set up ngrok in advance of this:

import threading
import time
import os
import asyncio
from pyngrok import ngrok
import threading
import queue
import time
from threading import Thread

# Get your ngrok token from your ngrok account:
# https://dashboard.ngrok.com/get-started/your-authtoken
token="your token goes here - don't forget to replace this with it!"
ngrok.set_auth_token(token)

# set up a stoppable thread (not mandatory, but cleaner if you want to stop this later
class StoppableThread(threading.Thread):
    def __init__(self, *args, **kwargs):
        super(StoppableThread, self).__init__(*args, **kwargs)
        self._stop_event = threading.Event()

    def stop(self):
        self._stop_event.set()

    def is_stopped(self):
        return self._stop_event.is_set()

def start_ngrok(q, stop_event):
    try:
        # Start an HTTP tunnel on the specified port
        public_url = ngrok.connect(11434)
        # Put the public URL in the queue
        q.put(public_url)
        # Keep the thread alive until stop event is set
        while not stop_event.is_set():
            time.sleep(1)  # Adjust sleep time as needed
    except Exception as e:
        print(f"Error in start_ngrok: {e}")

Run that code so the functions exist, then in the next cell, start ngrok in a separate thread so it doesn't hang your colab - we'll use a queue so we can still share data between threads because we want to know what the ngrok public URL will be when it runs:

# Create a queue to share data between threads
url_queue = queue.Queue()

# Start ngrok in a separate thread
ngrok_thread = StoppableThread(target=start_ngrok, args=(url_queue, StoppableThread.is_stopped))
ngrok_thread.start()

That will be running, but you need to get the results from the queue to see what ngrok returned, so then do:

# Wait for the ngrok tunnel to be established
while True:
    try:
        public_url = url_queue.get()
        if public_url:
            break
        print("Waiting for ngrok URL...")
        time.sleep(1)
    except Exception as e:
        print(f"Error in retrieving ngrok URL: {e}")

print("Ngrok tunnel established at:", public_url)

This should output something like:

Ngrok tunnel established at: NgrokTunnel: "https://{somelongsubdomain}.ngrok-free.app" -> "http://localhost:11434"

3. Run ollama as an async process

import os
import asyncio

# NB: You may need to set these depending and get cuda working depending which backend you are running.
# Set environment variable for NVIDIA library
# Set environment variables for CUDA
os.environ['PATH'] += ':/usr/local/cuda/bin'
# Set LD_LIBRARY_PATH to include both /usr/lib64-nvidia and CUDA lib directories
os.environ['LD_LIBRARY_PATH'] = '/usr/lib64-nvidia:/usr/local/cuda/lib64'

async def run_process(cmd):
    print('>>> starting', *cmd)
    process = await asyncio.create_subprocess_exec(
        *cmd,
        stdout=asyncio.subprocess.PIPE,
        stderr=asyncio.subprocess.PIPE
    )

    # define an async pipe function
    async def pipe(lines):
        async for line in lines:
            print(line.decode().strip())

        await asyncio.gather(
            pipe(process.stdout),
            pipe(process.stderr),
        )

    # call it
    await asyncio.gather(pipe(process.stdout), pipe(process.stderr))

That creates the function to run an async command but doesn't run it yet.

This will start ollama in a separate thread so your Colab isn't blocked:

import asyncio
import threading

async def start_ollama_serve():
    await run_process(['ollama', 'serve'])

def run_async_in_thread(loop, coro):
    asyncio.set_event_loop(loop)
    loop.run_until_complete(coro) 
    loop.close()

# Create a new event loop that will run in a new thread 
new_loop = asyncio.new_event_loop() 

# Start ollama serve in a separate thread so the cell won't block execution 
thread = threading.Thread(target=run_async_in_thread, args=(new_loop, start_ollama_serve()))
thread.start()

It should produce something like:

>>> starting ollama serve
Couldn't find '/root/.ollama/id_ed25519'. Generating new private key.
Your new public key is:

ssh-ed25519 {some key}

2024/01/16 20:19:11 images.go:808: total blobs: 0
2024/01/16 20:19:11 images.go:815: total unused blobs removed: 0
2024/01/16 20:19:11 routes.go:930: Listening on 127.0.0.1:11434 (version 0.1.20)

Now you're all set up. You can either do the next steps in the Colab, but it might be easier to run on your local machine if you normally dev there.

4. Run an ollama model remotely from your local dev environment

Assuming you have installed ollama on your local dev environment (say WSL2), I'm assuming it's linux anyway... but i.e. your laptop or desktop machine in front of you (as opposed to Colab).

Replace the actual URI below with whatever public URI ngrok reported above:

export OLLAMA_HOST=https://{longcode}.ngrok-free.app/

You can now run ollama and it will run on the remote in your Colab (so long as that's stays up and running).

e.g. run this on your local machine and it will look as if it's running locally but it's really running in your Colab and the results are being served to wherever you call this from (so long as the OLLAMA_HOST is set correctly and is a valid tunnel to your ollama service:

ollama run mistral

You can now interact with the model on the command line locally but the model runs on the Colab.

If you want to run larger models, like mixtral, then you need to be sure to connect your Colab to a Back end compute that's powerful enough (e.g. 48GB+ of RAM, so V100 GPU is minimum spec for this at the time of writing).

Note: If you have any issues with cuda or nvidia showing in the ouputs of any steps above, don't proceed until you fix them.