Infinity Embeddings Server on AWS User Guide
Overview
This image runs Infinity 0.0.77, the high-throughput, low-latency server for text embedding and reranking models (michaelfeil/infinity, MIT), on Ubuntu 24.04 LTS with NVIDIA GPU acceleration. Infinity serves an OpenAI-compatible /embeddings endpoint (plus /rerank and /classify), so existing OpenAI SDK code works unchanged. The NVIDIA driver is preinstalled and verified on real hardware during the build.
The server listens on the loopback address 127.0.0.1:7997 under an unprivileged infinity system account; nginx fronts it on port 80 with HTTP Basic Authentication. The public /health endpoint stays open; /embeddings, /rerank and /classify require the password. The default security group opens port 22 (SSH) and port 80 (HTTP) only, so 7997 is not reachable externally.
On the first boot of every deployed instance a one-shot service generates a fresh password, unique to that instance, and writes it to /root/infinity-credentials.txt (mode 0600, root only). The embedding model lives under HF_HOME=/var/lib/infinity/hf on a dedicated, independently resizable EBS data volume. A small open-weights model (BAAI/bge-base-en-v1.5, MIT) is pre-downloaded and served by default; you serve a different model by editing MODEL in /etc/infinity/infinity.env.
Prerequisites
- An AWS account subscribed to this product in AWS Marketplace.
- An EC2 key pair in your target region for SSH access.
- A security group allowing inbound TCP 22 (SSH) from your IP and TCP 80 (HTTP) from your users.
- An NVIDIA GPU instance. Recommended:
g4dn.xlarge(NVIDIA T4) — embedding models are light and run comfortably on a T4. Useg5/g6for higher throughput or larger models.
Connecting to your instance
| OS variant | Login user | Example |
|---|---|---|
| Ubuntu 24.04 | ubuntu |
ssh -i your-key.pem ubuntu@<instance-public-ip> |
Step 1 - Launch from the AWS Marketplace console
- Open the product page in AWS Marketplace and choose Continue to Subscribe, then Continue to Configuration.
- Select the Infinity 0.0.77 on Ubuntu 24.04 delivery option and your region, then Continue to Launch.
- Choose a GPU instance type (
g4dn.xlargeor larger), your VPC/subnet, key pair and the security group described above, and launch.
Step 2 - Launch from the AWS CLI
aws ec2 run-instances \
--image-id ami-xxxxxxxxxxxxxxxxx \
--instance-type g4dn.xlarge \
--key-name your-key \
--security-group-ids sg-xxxxxxxx \
--tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=infinity}]'
Step 3 - Connect to your instance
ssh -i your-key.pem ubuntu@<instance-public-ip>
Step 4 - Confirm the GPU and services
nvidia-smi --query-gpu=name,memory.total --format=csv,noheader
systemctl is-active infinity.service nginx.service
curl -s http://127.0.0.1/health -w ' HTTP %{http_code}\n'
Expected output:
Tesla T4, 15360 MiB
active
active
{"unix":1781039921.9092817} HTTP 200

Step 5 - Retrieve your password
sudo cat /root/infinity-credentials.txt
# Infinity Embeddings Server - generated on first boot by infinity-firstboot.service
INFINITY_URL=http://<instance-public-ip>/
INFINITY_MODEL=BAAI/bge-base-en-v1.5
INFINITY_USERNAME=admin
INFINITY_PASSWORD=<your-unique-password>
Step 6 - List the served model
The /health endpoint is open; everything else requires the password (user admin).
PASS=$(sudo grep '^INFINITY_PASSWORD=' /root/infinity-credentials.txt | cut -d= -f2-)
curl -s -u admin:$PASS http://127.0.0.1/models | jq -c '.data[] | {id}'
{"id":"BAAI/bge-base-en-v1.5"}
Step 7 - Generate embeddings (OpenAI-compatible)
PASS=$(sudo grep '^INFINITY_PASSWORD=' /root/infinity-credentials.txt | cut -d= -f2-)
curl -s -u admin:$PASS http://127.0.0.1/embeddings -H 'Content-Type: application/json' \
-d '{"model":"BAAI/bge-base-en-v1.5","input":["the quick brown fox","retrieval augmented generation"]}' \
| jq -c '{model, count:(.data|length), dim:(.data[0].embedding|length)}'
{"model":"BAAI/bge-base-en-v1.5","count":2,"dim":768}
From your own machine, target the instance's public IP instead of 127.0.0.1.
Step 8 - Use the OpenAI SDK
from openai import OpenAI
import base64
auth = base64.b64encode(b"admin:<your-unique-password>").decode()
client = OpenAI(base_url="http://<instance-public-ip>", api_key="unused",
default_headers={"Authorization": "Basic " + auth})
r = client.embeddings.create(model="BAAI/bge-base-en-v1.5", input=["hello world"])
print(len(r.data[0].embedding))
Feed the vectors into a vector database such as Weaviate or Chroma to build a retrieval-augmented-generation pipeline.
Step 9 - Confirm GPU use
nvidia-smi --query-compute-apps=process_name,used_memory --format=csv,noheader
A Python process holding GPU memory confirms the model is served on the GPU:
/opt/infinity/venv/bin/python3, 1664 MiB
Serving a different model
Infinity serves the model named in its environment file. Edit and restart:
sudo sed -i 's#^MODEL=.*#MODEL=mixedbread-ai/mxbai-embed-large-v1#' /etc/infinity/infinity.env
sudo systemctl restart infinity.service
The model is downloaded into /var/lib/infinity/hf on first use. Infinity also serves reranking models (query the /rerank endpoint) and CLIP models.
Enabling HTTPS
sudo apt-get update && sudo apt-get install -y certbot python3-certbot-nginx
sudo certbot --nginx -d your-domain.example.com
certbot edits the nginx site at /etc/nginx/sites-available/cloudimg-infinity to add the TLS listener and arranges automatic renewal.
Backup and maintenance
- Model weights live under
/var/lib/infinity/hfon a dedicated EBS volume. Snapshot that volume to back up downloaded models, or re-download by model name. - The password is in the nginx htpasswd file
/etc/nginx/.infinity.htpasswd; rotate it withsudo htpasswd /etc/nginx/.infinity.htpasswd admin. - Restart with
sudo systemctl restart infinity.service; logs:sudo journalctl -u infinity.service.
Support
cloudimg provides 24/7 technical support for this image by email and chat, covering Infinity deployment, model selection, GPU sizing, batching and throughput tuning, the OpenAI-compatible API, TLS termination and scaling. Contact details are on the AWS Marketplace listing.