Artificial Intelligence (AI) AWS

vLLM on AWS User Guide

Last updated: 2026-06-09 | Product: vLLM on AWS

Overview

This image runs vLLM 0.22, the high-throughput, memory-efficient inference and serving engine for large language models (PagedAttention), on Ubuntu 24.04 LTS with NVIDIA GPU acceleration. vLLM exposes an OpenAI-compatible REST API, so existing OpenAI SDK code works unchanged. The NVIDIA datacenter driver and CUDA toolkit are preinstalled and verified on real hardware during the build.

The server listens on the loopback address 127.0.0.1:8000 under an unprivileged vllm system account; nginx fronts it on port 80. vLLM's native API-key authentication is enabled: the /v1/* endpoints require an Authorization: Bearer token; /health stays open. The default security group opens port 22 (SSH) and port 80 (HTTP) only, so 8000 is not reachable externally.

On the first boot of every deployed instance a one-shot service generates a fresh API key, unique to that instance, and writes it to /root/vllm-credentials.txt (mode 0600, root only). Model weights live under HF_HOME=/var/lib/vllm/hf on a dedicated, independently resizable EBS data volume. A small open-weights model (Qwen/Qwen2.5-1.5B-Instruct, Apache-2.0) is pre-downloaded and served by default; you serve a different model by editing MODEL in /etc/vllm/vllm.env.

GPU requirement: vLLM's default flashinfer attention backend requires an NVIDIA Ampere or newer GPU. Launch on a g5 (A10G) or g6 (L4) instance — not the older g4dn (T4 / Turing).

Prerequisites

An AWS account subscribed to this product in AWS Marketplace.
An EC2 key pair in your target region for SSH access.
A security group allowing inbound TCP 22 (SSH) from your IP and TCP 80 (HTTP) from your users.
An NVIDIA Ampere+ GPU instance. Recommended: g5.xlarge (A10G, 24 GB). For larger models or higher throughput use g5.2xlarge+ or g6 (L4).

Connecting to your instance

OS variant	Login user	Example
Ubuntu 24.04	`ubuntu`	`ssh -i your-key.pem ubuntu@<instance-public-ip>`

Step 1 - Launch from the AWS Marketplace console

Open the product page in AWS Marketplace and choose Continue to Subscribe, then Continue to Configuration.
Select the vLLM 0.22 on Ubuntu 24.04 delivery option and your region, then Continue to Launch.
Choose an Ampere+ GPU instance type (g5.xlarge or larger), your VPC/subnet, key pair and the security group described above, and launch.

Step 2 - Launch from the AWS CLI

aws ec2 run-instances \
  --image-id ami-xxxxxxxxxxxxxxxxx \
  --instance-type g5.xlarge \
  --key-name your-key \
  --security-group-ids sg-xxxxxxxx \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=vllm}]'

Step 3 - Connect to your instance

ssh -i your-key.pem ubuntu@<instance-public-ip>

Step 4 - Confirm the GPU and services

nvidia-smi --query-gpu=name,memory.total --format=csv,noheader
systemctl is-active vllm.service nginx.service
curl -s http://127.0.0.1/health -w 'HTTP %{http_code}\n'

Expected output:

NVIDIA A10G, 23028 MiB
active
active
HTTP 200

vLLM running on the cloudimg GPU AMI - A10G detected, OpenAI-compatible API served on the GPU, health open and /v1 secured by an API key

Step 5 - Retrieve your API key

sudo cat /root/vllm-credentials.txt

# vLLM - generated on first boot by vllm-firstboot.service
VLLM_URL=http://<instance-public-ip>/
VLLM_MODEL=Qwen/Qwen2.5-1.5B-Instruct
VLLM_API_KEY=sk-<your-unique-key>

Step 6 - List the served model

The /health endpoint is open; the OpenAI-compatible endpoints require the key.

KEY=$(sudo grep '^VLLM_API_KEY=' /root/vllm-credentials.txt | cut -d= -f2-)
curl -s -H "Authorization: Bearer $KEY" http://127.0.0.1/v1/models | jq -c '.data[] | {id, max_model_len}'

{"id":"Qwen/Qwen2.5-1.5B-Instruct","max_model_len":8192}

Step 7 - Chat completion (OpenAI-compatible)

KEY=$(sudo grep '^VLLM_API_KEY=' /root/vllm-credentials.txt | cut -d= -f2-)
curl -s -H "Authorization: Bearer $KEY" -H 'Content-Type: application/json' \
  http://127.0.0.1/v1/chat/completions \
  -d '{"model":"Qwen/Qwen2.5-1.5B-Instruct","messages":[{"role":"user","content":"In one sentence, what is vLLM?"}],"max_tokens":64}' \
  | jq -r '.choices[0].message.content'

From your own machine, target the instance's public IP instead of 127.0.0.1.

Step 8 - Use the OpenAI SDK

from openai import OpenAI
client = OpenAI(base_url="http://<instance-public-ip>/v1", api_key="<your-unique-key>")
resp = client.chat.completions.create(
    model="Qwen/Qwen2.5-1.5B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.choices[0].message.content)

This makes the instance a drop-in private LLM backend for LangChain, LlamaIndex or your own application.

Step 9 - Confirm GPU use

nvidia-smi --query-compute-apps=process_name,used_memory --format=csv,noheader

A vLLM process holding GPU memory confirms the model is served on the GPU:

/opt/vllm/venv/bin/python3, 19534 MiB

Serving a different model

vLLM serves one model, chosen at startup. Edit the model name and restart:

sudo sed -i 's#^MODEL=.*#MODEL=Qwen/Qwen2.5-7B-Instruct#' /etc/vllm/vllm.env
sudo systemctl restart vllm.service

The model is downloaded into /var/lib/vllm/hf on first use. Larger models need more GPU memory - a 7B model in fp16 needs roughly 16 GB, so use g5.xlarge (24 GB) or larger.

Enabling HTTPS

sudo apt-get update && sudo apt-get install -y certbot python3-certbot-nginx
sudo certbot --nginx -d your-domain.example.com

certbot edits the nginx site at /etc/nginx/sites-available/cloudimg-vllm to add the TLS listener and arranges automatic renewal.

Backup and maintenance

Model weights live under /var/lib/vllm/hf on a dedicated EBS volume. Snapshot that volume to back up downloaded models, or re-download with the model name.
The API key is in /etc/vllm/vllm.env (VLLM_API_KEY); rotate it by editing that file and restarting vllm.service.
Restart with sudo systemctl restart vllm.service; logs: sudo journalctl -u vllm.service.
Tune throughput/memory with vLLM flags in the service ExecStart (--gpu-memory-utilization, --max-model-len, --max-num-seqs).

Support

cloudimg provides 24/7 technical support for this image by email and chat, covering vLLM deployment, model selection, GPU sizing, throughput tuning, quantization, the OpenAI-compatible API, TLS termination and scaling. Contact details are on the AWS Marketplace listing.