I have a code like this. And I'm launching it. I get an ngrok link.
!pip install aiohttp pyngrok
import os
import asyncio
from aiohttp import ClientSession
# Set LD_LIBRARY_PATH so the system NVIDIA library becomes preferred
# over the built-in library. This is particularly important for
# Google Colab which installs older drivers
os.environ.update({'LD_LIBRARY_PATH': '/usr/lib64-nvidia'})
async def run(cmd):
'''
run is a helper function to run subcommands asynchronously.
'''
print('>>> starting', *cmd)
p = await asyncio.subprocess.create_subprocess_exec(
*cmd,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
async def pipe(lines):
async for line in lines:
print(line.strip().decode('utf-8'))
await asyncio.gather(
pipe(p.stdout),
pipe(p.stderr),
)
await asyncio.gather(
run(['ollama', 'serve']),
run(['ngrok', 'http', '--log', 'stderr', '11434']),
)
Which I'm following, but the following is on the page
How can I fix this? Before that, I did the following
!choco install ngrok
!ngrok config add-authtoken -----
!curl https://ollama.ai/install.sh | sh
!command -v systemctl >/dev/null && sudo systemctl stop ollama
!curl https://ollama.ai/install.sh | sh
# should produce, among other thigns:
# The Ollama API is now available at 0.0.0.0:11434
This means Ollama is running (but do check to see if there are errors, especially around graphics capability/Cuda as these may interfere.
However, Don't run
!command -v systemctl >/dev/null && sudo systemctl stop ollama
(unless you want to stop Ollama).
The next step is to start the Ollama service, but since you are using ngrok
I'm assuming you want to be able to run the LLM from other environments outside the Colab? If this isn't the case, then you don't really need ngrok, but since Colabs are tricky to get working nicely with async code and threads it's useful to use the Colab to e.g. run a powerful enough VM to play with larger models than (say) anthing you could run on your dev environment (if this is an issue).
Ollama isn't yet running as a service but we can set up ngrok in advance of this:
import threading
import time
import os
import asyncio
from pyngrok import ngrok
import threading
import queue
import time
from threading import Thread
# Get your ngrok token from your ngrok account:
# https://dashboard.ngrok.com/get-started/your-authtoken
token="your token goes here - don't forget to replace this with it!"
ngrok.set_auth_token(token)
# set up a stoppable thread (not mandatory, but cleaner if you want to stop this later
class StoppableThread(threading.Thread):
def __init__(self, *args, **kwargs):
super(StoppableThread, self).__init__(*args, **kwargs)
self._stop_event = threading.Event()
def stop(self):
self._stop_event.set()
def is_stopped(self):
return self._stop_event.is_set()
def start_ngrok(q, stop_event):
try:
# Start an HTTP tunnel on the specified port
public_url = ngrok.connect(11434)
# Put the public URL in the queue
q.put(public_url)
# Keep the thread alive until stop event is set
while not stop_event.is_set():
time.sleep(1) # Adjust sleep time as needed
except Exception as e:
print(f"Error in start_ngrok: {e}")
Run that code so the functions exist, then in the next cell, start ngrok in a separate thread so it doesn't hang your colab - we'll use a queue so we can still share data between threads because we want to know what the ngrok public URL will be when it runs:
# Create a queue to share data between threads
url_queue = queue.Queue()
# Start ngrok in a separate thread
ngrok_thread = StoppableThread(target=start_ngrok, args=(url_queue, StoppableThread.is_stopped))
ngrok_thread.start()
That will be running, but you need to get the results from the queue to see what ngrok returned, so then do:
# Wait for the ngrok tunnel to be established
while True:
try:
public_url = url_queue.get()
if public_url:
break
print("Waiting for ngrok URL...")
time.sleep(1)
except Exception as e:
print(f"Error in retrieving ngrok URL: {e}")
print("Ngrok tunnel established at:", public_url)
This should output something like:
Ngrok tunnel established at: NgrokTunnel: "https://{somelongsubdomain}.ngrok-free.app" -> "http://localhost:11434"
import os
import asyncio
# NB: You may need to set these depending and get cuda working depending which backend you are running.
# Set environment variable for NVIDIA library
# Set environment variables for CUDA
os.environ['PATH'] += ':/usr/local/cuda/bin'
# Set LD_LIBRARY_PATH to include both /usr/lib64-nvidia and CUDA lib directories
os.environ['LD_LIBRARY_PATH'] = '/usr/lib64-nvidia:/usr/local/cuda/lib64'
async def run_process(cmd):
print('>>> starting', *cmd)
process = await asyncio.create_subprocess_exec(
*cmd,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE
)
# define an async pipe function
async def pipe(lines):
async for line in lines:
print(line.decode().strip())
await asyncio.gather(
pipe(process.stdout),
pipe(process.stderr),
)
# call it
await asyncio.gather(pipe(process.stdout), pipe(process.stderr))
That creates the function to run an async command but doesn't run it yet.
This will start ollama in a separate thread so your Colab isn't blocked:
import asyncio
import threading
async def start_ollama_serve():
await run_process(['ollama', 'serve'])
def run_async_in_thread(loop, coro):
asyncio.set_event_loop(loop)
loop.run_until_complete(coro)
loop.close()
# Create a new event loop that will run in a new thread
new_loop = asyncio.new_event_loop()
# Start ollama serve in a separate thread so the cell won't block execution
thread = threading.Thread(target=run_async_in_thread, args=(new_loop, start_ollama_serve()))
thread.start()
It should produce something like:
>>> starting ollama serve
Couldn't find '/root/.ollama/id_ed25519'. Generating new private key.
Your new public key is:
ssh-ed25519 {some key}
2024/01/16 20:19:11 images.go:808: total blobs: 0
2024/01/16 20:19:11 images.go:815: total unused blobs removed: 0
2024/01/16 20:19:11 routes.go:930: Listening on 127.0.0.1:11434 (version 0.1.20)
Now you're all set up. You can either do the next steps in the Colab, but it might be easier to run on your local machine if you normally dev there.
Assuming you have installed ollama on your local dev environment (say WSL2), I'm assuming it's linux anyway... but i.e. your laptop or desktop machine in front of you (as opposed to Colab).
Replace the actual URI below with whatever public URI ngrok reported above:
export OLLAMA_HOST=https://{longcode}.ngrok-free.app/
You can now run ollama and it will run on the remote in your Colab (so long as that's stays up and running).
e.g. run this on your local machine and it will look as if it's running locally but it's really running in your Colab and the results are being served to wherever you call this from (so long as the OLLAMA_HOST is set correctly and is a valid tunnel to your ollama service:
ollama run mistral
You can now interact with the model on the command line locally but the model runs on the Colab.
If you want to run larger models, like mixtral, then you need to be sure to connect your Colab to a Back end compute that's powerful enough (e.g. 48GB+ of RAM, so V100 GPU is minimum spec for this at the time of writing).
Note: If you have any issues with cuda or nvidia showing in the ouputs of any steps above, don't proceed until you fix them.