Machine Learning AWS

Ollama on AWS User Guide

| Product: Ollama on AWS

Overview

This image runs Ollama 0.30, the easiest way to run open large language models locally - pull and serve Llama, Mistral, Gemma, Phi, Qwen and DeepSeek behind a REST API that is also OpenAI chat-completions compatible - on Ubuntu 24.04 LTS with NVIDIA GPU acceleration. The NVIDIA datacenter driver is preinstalled and verified on real hardware during the build, and Ollama auto-detects the GPU to offload model inference.

The server listens on the loopback address 127.0.0.1:11434 under an unprivileged ollama system account; nginx fronts it on port 80 with HTTP Basic Authentication. The public /api/version endpoint stays open for health checks; everything else (model pull, generate, chat and the /v1/* OpenAI-compatible endpoints) requires the password. The default security group opens port 22 (SSH) and port 80 (HTTP) only, so 11434 is not reachable externally.

On the first boot of every deployed instance a one-shot service generates a fresh password, unique to that instance, and writes it to /root/ollama-credentials.txt (mode 0600, root only). Model weights live under /var/lib/ollama/models on a dedicated, independently resizable EBS data volume. A small starter model (llama3.2:1b) is pre-pulled so the API responds immediately; you pull additional models with ollama pull.

Prerequisites

  • An AWS account subscribed to this product in AWS Marketplace.
  • An EC2 key pair in your target region for SSH access.
  • A security group allowing inbound TCP 22 (SSH) from your IP and TCP 80 (HTTP) from your users.
  • An NVIDIA GPU instance type. Recommended: g4dn.xlarge (NVIDIA T4, good value for 1B-8B models). For larger models or higher throughput use g5/g6 (A10G / L4) or multi-GPU g5.12xlarge.

Connecting to your instance

OS variant Login user Example
Ubuntu 24.04 ubuntu ssh -i your-key.pem ubuntu@<instance-public-ip>

Step 1 - Launch from the AWS Marketplace console

  1. Open the product page in AWS Marketplace and choose Continue to Subscribe, then Continue to Configuration.
  2. Select the Ollama 0.30 on Ubuntu 24.04 delivery option and your region, then Continue to Launch.
  3. Choose a GPU instance type (g4dn.xlarge or larger), your VPC/subnet, key pair and the security group described above, and launch.

Step 2 - Launch from the AWS CLI

aws ec2 run-instances \
  --image-id ami-xxxxxxxxxxxxxxxxx \
  --instance-type g4dn.xlarge \
  --key-name your-key \
  --security-group-ids sg-xxxxxxxx \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=ollama}]'

Step 3 - Connect to your instance

ssh -i your-key.pem ubuntu@<instance-public-ip>

Step 4 - Confirm the GPU and services

nvidia-smi --query-gpu=name,driver_version,memory.total --format=csv,noheader
systemctl is-active ollama.service nginx.service
curl -s http://127.0.0.1/api/version

Expected output:

Tesla T4, 610.43.02, 15360 MiB
active
active
{"version":"0.30.7"}

Ollama running on the cloudimg GPU AMI - Tesla T4 detected, a model loaded on the GPU, services active and the API gated by basic auth

Step 5 - Retrieve your password

sudo cat /root/ollama-credentials.txt
# Ollama - generated on first boot by ollama-firstboot.service
OLLAMA_URL=http://<instance-public-ip>/
OLLAMA_USERNAME=admin
OLLAMA_PASSWORD=<your-unique-password>

Step 6 - List the pre-pulled model and pull more

The /api/version endpoint is open; everything else is gated by HTTP Basic Authentication (user admin + your password). List the starter model:

PASS=$(sudo grep '^OLLAMA_PASSWORD=' /root/ollama-credentials.txt | cut -d= -f2-)
curl -s -u admin:$PASS http://127.0.0.1/api/tags | jq -c '.models[] | {name, size}'
{"name":"llama3.2:1b","size":1321098329}

Pull a larger model (for example Llama 3.2 3B):

curl -u admin:$PASS http://<instance-public-ip>/api/pull -d '{"name":"llama3.2:3b"}'

Step 7 - Generate a completion (REST API)

PASS=$(sudo grep '^OLLAMA_PASSWORD=' /root/ollama-credentials.txt | cut -d= -f2-)
curl -s -u admin:$PASS http://127.0.0.1/api/generate \
  -d '{"model":"llama3.2:1b","prompt":"In one sentence, what is a large language model?","stream":false}' | jq -r .response

From your own machine, target the instance's public IP instead of 127.0.0.1.

Step 8 - Use the OpenAI-compatible endpoint

Point any OpenAI SDK at http://<instance-public-ip>/v1 with HTTP Basic Auth:

from openai import OpenAI
client = OpenAI(
    base_url="http://<instance-public-ip>/v1",
    api_key="ollama",                       # ignored by Ollama
    default_headers={"Authorization": "Basic " + __import__("base64").b64encode(b"admin:<your-unique-password>").decode()},
)
resp = client.chat.completions.create(
    model="llama3.2:1b",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.choices[0].message.content)

This makes the instance a drop-in private LLM backend for LangChain, LlamaIndex or your own application.

Step 9 - Confirm GPU offload

PASS=$(sudo grep '^OLLAMA_PASSWORD=' /root/ollama-credentials.txt | cut -d= -f2-)
curl -s -u admin:$PASS http://127.0.0.1/api/ps | jq -c '.models[] | {name, size_vram}'

A non-zero size_vram confirms the model is running on the GPU rather than the CPU:

{"name":"llama3.2:1b","size_vram":1514584145}

Enabling HTTPS

sudo apt-get update && sudo apt-get install -y certbot python3-certbot-nginx
sudo certbot --nginx -d your-domain.example.com

certbot edits the nginx site at /etc/nginx/sites-available/cloudimg-ollama to add the TLS listener and arranges automatic renewal.

Backup and maintenance

  • Model weights live under /var/lib/ollama/models on a dedicated EBS volume. Snapshot that volume to back up pulled models, or simply re-pull them with ollama pull on a new instance.
  • The password is in the nginx htpasswd file /etc/nginx/.ollama.htpasswd; rotate it with sudo htpasswd /etc/nginx/.ollama.htpasswd admin.
  • Pull models with the bundled CLI: sudo -u ollama OLLAMA_HOST=127.0.0.1:11434 ollama pull mistral; list with ollama list; remove with ollama rm <model>.
  • Restart with sudo systemctl restart ollama.service; logs: sudo journalctl -u ollama.service.
  • GPU sizing: a single model must fit in GPU memory (the T4 has 16 GB). For 70B-class models use a larger or multi-GPU instance; Ollama falls back to CPU if a model does not fit, which is much slower.

Support

cloudimg provides 24/7 technical support for this image by email and chat, covering Ollama deployment, model selection, GPU sizing, quantization, the OpenAI-compatible API, TLS termination and scaling. Contact details are on the AWS Marketplace listing.