Artificial Intelligence (AI) AWS

Infinity Embeddings Server on AWS User Guide

Last updated: 2026-06-09 | Product: Infinity Embeddings Server on AWS

Overview

This image runs Infinity 0.0.77, the high-throughput, low-latency server for text embedding and reranking models (michaelfeil/infinity, MIT), on Ubuntu 24.04 LTS with NVIDIA GPU acceleration. Infinity serves an OpenAI-compatible /embeddings endpoint (plus /rerank and /classify), so existing OpenAI SDK code works unchanged. The NVIDIA driver is preinstalled and verified on real hardware during the build.

The server listens on the loopback address 127.0.0.1:7997 under an unprivileged infinity system account; nginx fronts it on port 80 with HTTP Basic Authentication. The public /health endpoint stays open; /embeddings, /rerank and /classify require the password. The default security group opens port 22 (SSH) and port 80 (HTTP) only, so 7997 is not reachable externally.

On the first boot of every deployed instance a one-shot service generates a fresh password, unique to that instance, and writes it to /root/infinity-credentials.txt (mode 0600, root only). The embedding model lives under HF_HOME=/var/lib/infinity/hf on a dedicated, independently resizable EBS data volume. A small open-weights model (BAAI/bge-base-en-v1.5, MIT) is pre-downloaded and served by default; you serve a different model by editing MODEL in /etc/infinity/infinity.env.

Prerequisites

An AWS account subscribed to this product in AWS Marketplace.
An EC2 key pair in your target region for SSH access.
A security group allowing inbound TCP 22 (SSH) from your IP and TCP 80 (HTTP) from your users.
An NVIDIA GPU instance. Recommended: g4dn.xlarge (NVIDIA T4) — embedding models are light and run comfortably on a T4. Use g5/g6 for higher throughput or larger models.

Connecting to your instance

OS variant	Login user	Example
Ubuntu 24.04	`ubuntu`	`ssh -i your-key.pem ubuntu@<instance-public-ip>`

Step 1 - Launch from the AWS Marketplace console

Open the product page in AWS Marketplace and choose Continue to Subscribe, then Continue to Configuration.
Select the Infinity 0.0.77 on Ubuntu 24.04 delivery option and your region, then Continue to Launch.
Choose a GPU instance type (g4dn.xlarge or larger), your VPC/subnet, key pair and the security group described above, and launch.

Step 2 - Launch from the AWS CLI

aws ec2 run-instances \
  --image-id ami-xxxxxxxxxxxxxxxxx \
  --instance-type g4dn.xlarge \
  --key-name your-key \
  --security-group-ids sg-xxxxxxxx \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=infinity}]'

Step 3 - Connect to your instance

ssh -i your-key.pem ubuntu@<instance-public-ip>

Step 4 - Confirm the GPU and services

nvidia-smi --query-gpu=name,memory.total --format=csv,noheader
systemctl is-active infinity.service nginx.service
curl -s http://127.0.0.1/health -w ' HTTP %{http_code}\n'

Expected output:

Tesla T4, 15360 MiB
active
active
{"unix":1781039921.9092817} HTTP 200

Infinity running on the cloudimg GPU AMI - Tesla T4 detected, embeddings served on the GPU, health open and /embeddings gated by a per-instance password

Step 5 - Retrieve your password

sudo cat /root/infinity-credentials.txt

# Infinity Embeddings Server - generated on first boot by infinity-firstboot.service
INFINITY_URL=http://<instance-public-ip>/
INFINITY_MODEL=BAAI/bge-base-en-v1.5
INFINITY_USERNAME=admin
INFINITY_PASSWORD=<your-unique-password>

Step 6 - List the served model

The /health endpoint is open; everything else requires the password (user admin).

PASS=$(sudo grep '^INFINITY_PASSWORD=' /root/infinity-credentials.txt | cut -d= -f2-)
curl -s -u admin:$PASS http://127.0.0.1/models | jq -c '.data[] | {id}'

{"id":"BAAI/bge-base-en-v1.5"}

Step 7 - Generate embeddings (OpenAI-compatible)

PASS=$(sudo grep '^INFINITY_PASSWORD=' /root/infinity-credentials.txt | cut -d= -f2-)
curl -s -u admin:$PASS http://127.0.0.1/embeddings -H 'Content-Type: application/json' \
  -d '{"model":"BAAI/bge-base-en-v1.5","input":["the quick brown fox","retrieval augmented generation"]}' \
  | jq -c '{model, count:(.data|length), dim:(.data[0].embedding|length)}'

{"model":"BAAI/bge-base-en-v1.5","count":2,"dim":768}

From your own machine, target the instance's public IP instead of 127.0.0.1.

Step 8 - Use the OpenAI SDK

from openai import OpenAI
import base64
auth = base64.b64encode(b"admin:<your-unique-password>").decode()
client = OpenAI(base_url="http://<instance-public-ip>", api_key="unused",
                default_headers={"Authorization": "Basic " + auth})
r = client.embeddings.create(model="BAAI/bge-base-en-v1.5", input=["hello world"])
print(len(r.data[0].embedding))

Feed the vectors into a vector database such as Weaviate or Chroma to build a retrieval-augmented-generation pipeline.

Step 9 - Confirm GPU use

nvidia-smi --query-compute-apps=process_name,used_memory --format=csv,noheader

A Python process holding GPU memory confirms the model is served on the GPU:

/opt/infinity/venv/bin/python3, 1664 MiB

Serving a different model

Infinity serves the model named in its environment file. Edit and restart:

sudo sed -i 's#^MODEL=.*#MODEL=mixedbread-ai/mxbai-embed-large-v1#' /etc/infinity/infinity.env
sudo systemctl restart infinity.service

The model is downloaded into /var/lib/infinity/hf on first use. Infinity also serves reranking models (query the /rerank endpoint) and CLIP models.

Enabling HTTPS

sudo apt-get update && sudo apt-get install -y certbot python3-certbot-nginx
sudo certbot --nginx -d your-domain.example.com

certbot edits the nginx site at /etc/nginx/sites-available/cloudimg-infinity to add the TLS listener and arranges automatic renewal.

Backup and maintenance

Model weights live under /var/lib/infinity/hf on a dedicated EBS volume. Snapshot that volume to back up downloaded models, or re-download by model name.
The password is in the nginx htpasswd file /etc/nginx/.infinity.htpasswd; rotate it with sudo htpasswd /etc/nginx/.infinity.htpasswd admin.
Restart with sudo systemctl restart infinity.service; logs: sudo journalctl -u infinity.service.

Support

cloudimg provides 24/7 technical support for this image by email and chat, covering Infinity deployment, model selection, GPU sizing, batching and throughput tuning, the OpenAI-compatible API, TLS termination and scaling. Contact details are on the AWS Marketplace listing.