vLLM preinstalled for AWS with NVIDIA GPU acceleration. The high-throughput OpenAI-compatible LLM inference and serving engine, on Ubuntu 24.04 behind an nginx reverse proxy, secured by a unique API key generated on first boot. Backed by 24/7 cloudimg support.
## vLLM by cloudimg
vLLM is a high-throughput, memory-efficient inference and serving engine for large language models. Its PagedAttention scheduler delivers state-of-the-art serving throughput and it exposes an OpenAI-compatible REST API, so existing OpenAI SDK code works unchanged. This Amazon Machine Image delivers vLLM fully installed as a system service on an NVIDIA Ampere GPU instance, so a private, self-hosted LLM inference endpoint is running within minutes of launch. The release available is vLLM 0.22.
## GPU Accelerated
The NVIDIA datacenter driver and CUDA toolkit are preinstalled and verified on real hardware, and vLLM serves the model on the GPU out of the box. Launch on a g5 (A10G) or g6 (L4) Ampere+ instance.
## Secure First Boot
vLLM's native API-key authentication is enabled, with a unique key generated for every instance on first boot and written to a root only file. No shared or default key ships in the image.
## Ready To Use
Call the OpenAI-compatible endpoints from any OpenAI SDK, LangChain or LlamaIndex. A small open-weights model is pre-downloaded; serve a different model by editing the service environment file.
## cloudimg Support
cloudimg provides 24/7 technical support for this image, covering deployment, model selection, GPU sizing, throughput tuning, the OpenAI-compatible API, TLS and scaling.