vLLM on AWS User Guide
Overview
This image runs vLLM 0.22, the high-throughput, memory-efficient inference and serving engine for large language models (PagedAttention), on Ubuntu 24.04 LTS with NVIDIA GPU acceleration. vLLM exposes an OpenAI-compatible REST API, so existing OpenAI SDK code works unchanged. The NVIDIA datacenter driver and CUDA toolkit are preinstalled and verified on real hardware during the build.
The server listens on the loopback address 127.0.0.1:8000 under an unprivileged vllm system account; nginx fronts it on port 80. vLLM's native API-key authentication is enabled: the /v1/* endpoints require an Authorization: Bearer token; /health stays open. The default security group opens port 22 (SSH) and port 80 (HTTP) only, so 8000 is not reachable externally.
On the first boot of every deployed instance a one-shot service generates a fresh API key, unique to that instance, and writes it to /root/vllm-credentials.txt (mode 0600, root only). Model weights live under HF_HOME=/var/lib/vllm/hf on a dedicated, independently resizable EBS data volume. A small open-weights model (Qwen/Qwen2.5-1.5B-Instruct, Apache-2.0) is pre-downloaded and served by default; you serve a different model by editing MODEL in /etc/vllm/vllm.env.
GPU requirement: vLLM's default flashinfer attention backend requires an NVIDIA Ampere or newer GPU. Launch on a
g5(A10G) org6(L4) instance — not the olderg4dn(T4 / Turing).
Prerequisites
- An AWS account subscribed to this product in AWS Marketplace.
- An EC2 key pair in your target region for SSH access.
- A security group allowing inbound TCP 22 (SSH) from your IP and TCP 80 (HTTP) from your users.
- An NVIDIA Ampere+ GPU instance. Recommended:
g5.xlarge(A10G, 24 GB). For larger models or higher throughput useg5.2xlarge+ org6(L4).
Connecting to your instance
| OS variant | Login user | Example |
|---|---|---|
| Ubuntu 24.04 | ubuntu |
ssh -i your-key.pem ubuntu@<instance-public-ip> |
Step 1 - Launch from the AWS Marketplace console
- Open the product page in AWS Marketplace and choose Continue to Subscribe, then Continue to Configuration.
- Select the vLLM 0.22 on Ubuntu 24.04 delivery option and your region, then Continue to Launch.
- Choose an Ampere+ GPU instance type (
g5.xlargeor larger), your VPC/subnet, key pair and the security group described above, and launch.
Step 2 - Launch from the AWS CLI
aws ec2 run-instances \
--image-id ami-xxxxxxxxxxxxxxxxx \
--instance-type g5.xlarge \
--key-name your-key \
--security-group-ids sg-xxxxxxxx \
--tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=vllm}]'
Step 3 - Connect to your instance
ssh -i your-key.pem ubuntu@<instance-public-ip>
Step 4 - Confirm the GPU and services
nvidia-smi --query-gpu=name,memory.total --format=csv,noheader
systemctl is-active vllm.service nginx.service
curl -s http://127.0.0.1/health -w 'HTTP %{http_code}\n'
Expected output:
NVIDIA A10G, 23028 MiB
active
active
HTTP 200

Step 5 - Retrieve your API key
sudo cat /root/vllm-credentials.txt
# vLLM - generated on first boot by vllm-firstboot.service
VLLM_URL=http://<instance-public-ip>/
VLLM_MODEL=Qwen/Qwen2.5-1.5B-Instruct
VLLM_API_KEY=sk-<your-unique-key>
Step 6 - List the served model
The /health endpoint is open; the OpenAI-compatible endpoints require the key.
KEY=$(sudo grep '^VLLM_API_KEY=' /root/vllm-credentials.txt | cut -d= -f2-)
curl -s -H "Authorization: Bearer $KEY" http://127.0.0.1/v1/models | jq -c '.data[] | {id, max_model_len}'
{"id":"Qwen/Qwen2.5-1.5B-Instruct","max_model_len":8192}
Step 7 - Chat completion (OpenAI-compatible)
KEY=$(sudo grep '^VLLM_API_KEY=' /root/vllm-credentials.txt | cut -d= -f2-)
curl -s -H "Authorization: Bearer $KEY" -H 'Content-Type: application/json' \
http://127.0.0.1/v1/chat/completions \
-d '{"model":"Qwen/Qwen2.5-1.5B-Instruct","messages":[{"role":"user","content":"In one sentence, what is vLLM?"}],"max_tokens":64}' \
| jq -r '.choices[0].message.content'
From your own machine, target the instance's public IP instead of 127.0.0.1.
Step 8 - Use the OpenAI SDK
from openai import OpenAI
client = OpenAI(base_url="http://<instance-public-ip>/v1", api_key="<your-unique-key>")
resp = client.chat.completions.create(
model="Qwen/Qwen2.5-1.5B-Instruct",
messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.choices[0].message.content)
This makes the instance a drop-in private LLM backend for LangChain, LlamaIndex or your own application.
Step 9 - Confirm GPU use
nvidia-smi --query-compute-apps=process_name,used_memory --format=csv,noheader
A vLLM process holding GPU memory confirms the model is served on the GPU:
/opt/vllm/venv/bin/python3, 19534 MiB
Serving a different model
vLLM serves one model, chosen at startup. Edit the model name and restart:
sudo sed -i 's#^MODEL=.*#MODEL=Qwen/Qwen2.5-7B-Instruct#' /etc/vllm/vllm.env
sudo systemctl restart vllm.service
The model is downloaded into /var/lib/vllm/hf on first use. Larger models need more GPU memory - a 7B model in fp16 needs roughly 16 GB, so use g5.xlarge (24 GB) or larger.
Enabling HTTPS
sudo apt-get update && sudo apt-get install -y certbot python3-certbot-nginx
sudo certbot --nginx -d your-domain.example.com
certbot edits the nginx site at /etc/nginx/sites-available/cloudimg-vllm to add the TLS listener and arranges automatic renewal.
Backup and maintenance
- Model weights live under
/var/lib/vllm/hfon a dedicated EBS volume. Snapshot that volume to back up downloaded models, or re-download with the model name. - The API key is in
/etc/vllm/vllm.env(VLLM_API_KEY); rotate it by editing that file and restartingvllm.service. - Restart with
sudo systemctl restart vllm.service; logs:sudo journalctl -u vllm.service. - Tune throughput/memory with vLLM flags in the service
ExecStart(--gpu-memory-utilization,--max-model-len,--max-num-seqs).
Support
cloudimg provides 24/7 technical support for this image by email and chat, covering vLLM deployment, model selection, GPU sizing, throughput tuning, quantization, the OpenAI-compatible API, TLS termination and scaling. Contact details are on the AWS Marketplace listing.