Ollama on AWS User Guide
Overview
This image runs Ollama 0.30, the easiest way to run open large language models locally - pull and serve Llama, Mistral, Gemma, Phi, Qwen and DeepSeek behind a REST API that is also OpenAI chat-completions compatible - on Ubuntu 24.04 LTS with NVIDIA GPU acceleration. The NVIDIA datacenter driver is preinstalled and verified on real hardware during the build, and Ollama auto-detects the GPU to offload model inference.
The server listens on the loopback address 127.0.0.1:11434 under an unprivileged ollama system account; nginx fronts it on port 80 with HTTP Basic Authentication. The public /api/version endpoint stays open for health checks; everything else (model pull, generate, chat and the /v1/* OpenAI-compatible endpoints) requires the password. The default security group opens port 22 (SSH) and port 80 (HTTP) only, so 11434 is not reachable externally.
On the first boot of every deployed instance a one-shot service generates a fresh password, unique to that instance, and writes it to /root/ollama-credentials.txt (mode 0600, root only). Model weights live under /var/lib/ollama/models on a dedicated, independently resizable EBS data volume. A small starter model (llama3.2:1b) is pre-pulled so the API responds immediately; you pull additional models with ollama pull.
Prerequisites
- An AWS account subscribed to this product in AWS Marketplace.
- An EC2 key pair in your target region for SSH access.
- A security group allowing inbound TCP 22 (SSH) from your IP and TCP 80 (HTTP) from your users.
- An NVIDIA GPU instance type. Recommended:
g4dn.xlarge(NVIDIA T4, good value for 1B-8B models). For larger models or higher throughput useg5/g6(A10G / L4) or multi-GPUg5.12xlarge.
Connecting to your instance
| OS variant | Login user | Example |
|---|---|---|
| Ubuntu 24.04 | ubuntu |
ssh -i your-key.pem ubuntu@<instance-public-ip> |
Step 1 - Launch from the AWS Marketplace console
- Open the product page in AWS Marketplace and choose Continue to Subscribe, then Continue to Configuration.
- Select the Ollama 0.30 on Ubuntu 24.04 delivery option and your region, then Continue to Launch.
- Choose a GPU instance type (
g4dn.xlargeor larger), your VPC/subnet, key pair and the security group described above, and launch.
Step 2 - Launch from the AWS CLI
aws ec2 run-instances \
--image-id ami-xxxxxxxxxxxxxxxxx \
--instance-type g4dn.xlarge \
--key-name your-key \
--security-group-ids sg-xxxxxxxx \
--tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=ollama}]'
Step 3 - Connect to your instance
ssh -i your-key.pem ubuntu@<instance-public-ip>
Step 4 - Confirm the GPU and services
nvidia-smi --query-gpu=name,driver_version,memory.total --format=csv,noheader
systemctl is-active ollama.service nginx.service
curl -s http://127.0.0.1/api/version
Expected output:
Tesla T4, 610.43.02, 15360 MiB
active
active
{"version":"0.30.7"}

Step 5 - Retrieve your password
sudo cat /root/ollama-credentials.txt
# Ollama - generated on first boot by ollama-firstboot.service
OLLAMA_URL=http://<instance-public-ip>/
OLLAMA_USERNAME=admin
OLLAMA_PASSWORD=<your-unique-password>
Step 6 - List the pre-pulled model and pull more
The /api/version endpoint is open; everything else is gated by HTTP Basic Authentication (user admin + your password). List the starter model:
PASS=$(sudo grep '^OLLAMA_PASSWORD=' /root/ollama-credentials.txt | cut -d= -f2-)
curl -s -u admin:$PASS http://127.0.0.1/api/tags | jq -c '.models[] | {name, size}'
{"name":"llama3.2:1b","size":1321098329}
Pull a larger model (for example Llama 3.2 3B):
curl -u admin:$PASS http://<instance-public-ip>/api/pull -d '{"name":"llama3.2:3b"}'
Step 7 - Generate a completion (REST API)
PASS=$(sudo grep '^OLLAMA_PASSWORD=' /root/ollama-credentials.txt | cut -d= -f2-)
curl -s -u admin:$PASS http://127.0.0.1/api/generate \
-d '{"model":"llama3.2:1b","prompt":"In one sentence, what is a large language model?","stream":false}' | jq -r .response
From your own machine, target the instance's public IP instead of 127.0.0.1.
Step 8 - Use the OpenAI-compatible endpoint
Point any OpenAI SDK at http://<instance-public-ip>/v1 with HTTP Basic Auth:
from openai import OpenAI
client = OpenAI(
base_url="http://<instance-public-ip>/v1",
api_key="ollama", # ignored by Ollama
default_headers={"Authorization": "Basic " + __import__("base64").b64encode(b"admin:<your-unique-password>").decode()},
)
resp = client.chat.completions.create(
model="llama3.2:1b",
messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.choices[0].message.content)
This makes the instance a drop-in private LLM backend for LangChain, LlamaIndex or your own application.
Step 9 - Confirm GPU offload
PASS=$(sudo grep '^OLLAMA_PASSWORD=' /root/ollama-credentials.txt | cut -d= -f2-)
curl -s -u admin:$PASS http://127.0.0.1/api/ps | jq -c '.models[] | {name, size_vram}'
A non-zero size_vram confirms the model is running on the GPU rather than the CPU:
{"name":"llama3.2:1b","size_vram":1514584145}
Enabling HTTPS
sudo apt-get update && sudo apt-get install -y certbot python3-certbot-nginx
sudo certbot --nginx -d your-domain.example.com
certbot edits the nginx site at /etc/nginx/sites-available/cloudimg-ollama to add the TLS listener and arranges automatic renewal.
Backup and maintenance
- Model weights live under
/var/lib/ollama/modelson a dedicated EBS volume. Snapshot that volume to back up pulled models, or simply re-pull them withollama pullon a new instance. - The password is in the nginx htpasswd file
/etc/nginx/.ollama.htpasswd; rotate it withsudo htpasswd /etc/nginx/.ollama.htpasswd admin. - Pull models with the bundled CLI:
sudo -u ollama OLLAMA_HOST=127.0.0.1:11434 ollama pull mistral; list withollama list; remove withollama rm <model>. - Restart with
sudo systemctl restart ollama.service; logs:sudo journalctl -u ollama.service. - GPU sizing: a single model must fit in GPU memory (the T4 has 16 GB). For 70B-class models use a larger or multi-GPU instance; Ollama falls back to CPU if a model does not fit, which is much slower.
Support
cloudimg provides 24/7 technical support for this image by email and chat, covering Ollama deployment, model selection, GPU sizing, quantization, the OpenAI-compatible API, TLS termination and scaling. Contact details are on the AWS Marketplace listing.