Databases AWS

DuckDB on AWS User Guide

| Product: DuckDB on AWS

Overview

This image runs DuckDB, the open source in-process analytical database engine, on a single host. Because DuckDB is an embedded engine and not a server in the traditional sense, this Amazon Machine Image ships DuckDB inside a complete analytics container so you can connect, load data and run queries within minutes of launch. The release available is DuckDB 1.5.

The container consists of three parts:

  • The DuckDB command line client, installed at /usr/local/bin/duckdb, on every user's PATH. The CLI is the fastest way to drive DuckDB from an SSH session.
  • A Python virtual environment at /opt/duckdb-venv with the DuckDB Python client, pandas, PyArrow and JupyterLab installed, so notebooks can talk directly to the engine.
  • A JupyterLab notebook server that runs as a systemd unit and listens on 127.0.0.1:8888, fronted by nginx on port 80 with HTTP basic authentication. JupyterLab is where you write and run analytical notebooks against your data.

A one million row New York City yellow taxi trips parquet file is bundled at /opt/duckdb/samples/nyc_yellow_taxi_2024_01.parquet, alongside a starter notebook 01-duckdb-quickstart.ipynb. The notebook opens a persistent DuckDB database at /opt/duckdb/samples/main.duckdb and runs three analytical queries so you can see the engine in action before writing any code.

The JupyterLab basic-authentication password is generated on the first boot of every deployed instance. Two instances launched from the same Amazon Machine Image never share a password. The image generates a fresh password on first boot, writes it into the nginx htpasswd store, and stores the plain text value in /stage/scripts/duckdb-credentials.log with mode 0600 so that only the root user can read it.

DuckDB databases, notebooks and sample datasets are kept on a dedicated EBS data volume mounted at /opt/duckdb, separate from the operating system disk, so the analytics tier can be resized independently of the root volume.

Prerequisites

Before you deploy this image you need:

  • An Amazon Web Services account where you can launch EC2 instances
  • IAM permissions to launch instances, create security groups, and subscribe to AWS Marketplace products
  • An EC2 key pair in the target Region for SSH access to the instance
  • A VPC and subnet in the target Region, with a security group allowing inbound port 22 from your management network and inbound port 80 from the trusted networks that will reach the JupyterLab interface
  • The AWS CLI (version 2) installed locally if you plan to deploy from the command line

Step 1: Launch the Instance from the AWS Marketplace

Sign in to the AWS Management Console, open the EC2 service, and select Launch instance. Under Application and OS Images choose AWS Marketplace AMIs and search for DuckDB. Select the cloudimg listing and choose Select, then Continue on the subscription summary.

Pick an instance type of m5.large or larger. DuckDB benefits from the memory-to-vCPU ratio of the m-class instance families because analytical queries hold working sets in memory. Choose your EC2 key pair under Key pair (login). Under Network settings select your VPC and subnet, and either create or select a security group that allows inbound port 22 from your management network and inbound port 80 from the trusted networks that will reach JupyterLab. Do not open port 80 to the public internet, because JupyterLab executes arbitrary Python and the basic-authentication password is the only barrier. Leave the root volume at the default size or larger.

Select Launch instance. First boot initialisation takes approximately one minute after the instance state becomes Running and the status checks pass.

Step 2: Launch the Instance from the AWS CLI

The following block launches an instance from the cloudimg DuckDB Marketplace AMI into an existing subnet and security group. Replace <ami-id> with the AMI ID shown on the Marketplace listing, <key-name> with your EC2 key pair name, <subnet-id> with your subnet ID, and <security-group-id> with a security group that opens ports 22 and 80 as described above.

aws ec2 run-instances \
  --image-id <ami-id> \
  --instance-type m5.large \
  --key-name <key-name> \
  --subnet-id <subnet-id> \
  --security-group-ids <security-group-id> \
  --block-device-mappings '[{"DeviceName":"/dev/sda1","Ebs":{"VolumeSize":30,"VolumeType":"gp3"}}]' \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=duckdb-01}]'

The command prints a JSON document on success. Note the instance ID, then retrieve its public address once it is running with aws ec2 describe-instances --instance-ids <instance-id> --query "Reservations[].Instances[].PublicIpAddress" --output text.

Step 3: Connect and Retrieve the JupyterLab Password

Connect over SSH with the key pair you selected and the public IP address from step 2. The SSH login user depends on the operating system of the AMI variant you launched:

AMI variant SSH login user
DuckDB 1.5 on Ubuntu 24.04 ubuntu

The first boot service runs before the SSH daemon becomes ready, so the credentials file is always in place when you log in for the first time.

ssh <login-user>@<public-ip>
sudo cat /stage/scripts/duckdb-credentials.log

You will see a plain text file containing the JupyterLab URL, the basic-authentication username (duckdb), and the basic-authentication password. Copy these values somewhere secure such as a password manager or an encrypted vault, and do not commit them to source control.

Each command block in this guide that talks to JupyterLab through nginx begins by reading the password from the credentials file into a PASSWORD shell variable, so every block is self contained:

PASSWORD="$(sudo awk -F= '/^DUCKDB_UI_PASSWORD=/ {print $2}' /stage/scripts/duckdb-credentials.log)"
echo "duckdb basic-auth password length: ${#PASSWORD}"

Step 4: Verify the Server is Healthy

The image ships an unauthenticated health endpoint at /healthz on port 80, served by nginx itself. The JupyterLab application is served behind basic authentication at /:

curl -s -o /dev/null -w "healthz: %{http_code}\n" http://127.0.0.1/healthz
curl -s -o /dev/null -w "no auth: %{http_code}\n" http://127.0.0.1/
PASSWORD="$(sudo awk -F= '/^DUCKDB_UI_PASSWORD=/ {print $2}' /stage/scripts/duckdb-credentials.log)"
curl -s -L -o /dev/null -w "with auth: %{http_code}\n" -u "duckdb:${PASSWORD}" http://127.0.0.1/

The expected output is:

healthz: 200
no auth: 401
with auth: 200

The 401 response with no credentials confirms that nginx is enforcing basic authentication on the JupyterLab path. The 200 with the correct credentials confirms that the request reaches JupyterLab on 127.0.0.1:8888 through the reverse proxy.

Step 5: Open the JupyterLab Interface

Browse to http://<public-ip>/ in your web browser. Your browser will prompt for HTTP basic authentication. Enter duckdb as the username and the password you retrieved in step 3.

JupyterLab Launcher landing page

The first screen is the JupyterLab launcher. The left hand file browser shows the bundled starter notebook 01-duckdb-quickstart.ipynb. Double click the notebook to open it.

Starter notebook open with DuckDB code cells

The notebook contains a single connection cell followed by three analytical query cells, each working against the bundled parquet sample.

Step 6: Run the Starter Notebook

From the menu select Run then Run All Cells. The notebook executes the connection cell, prints v1.5.2, then runs the three queries:

  • Query 1 — count the rows in the parquet file (the bundled January 2024 yellow-taxi sample has 2,964,624 rows).
  • Query 2 — group by passenger count and compute the average fare and trip distance.
  • Query 3 — group by hour of day and compute the trip volume.

DuckDB analytical query results in JupyterLab

Query 2 shows that 1-passenger trips dominate the dataset (around 2.2 million of the 2.96 million trips), with an average fare close to $17.50 and an average trip distance close to 3.1 miles. The first query result confirms that DuckDB scans the entire parquet file in under one second on m5.large.

Step 7: Use the DuckDB CLI

The DuckDB command line client is on the PATH for every Linux user, so you can run analytical queries from an SSH session without opening JupyterLab.

duckdb --version
v1.5.2 (Variegata) 8a5851971f

Open a session against the persistent database that the notebook uses. The CLI shows a prompt where you can run SQL directly:

duckdb /opt/duckdb/samples/main.duckdb

Inside the CLI, run a sample analytical query:

SELECT COUNT(*) FROM '/opt/duckdb/samples/nyc_yellow_taxi_2024_01.parquet';

Or pass a SQL statement non-interactively with -c:

duckdb -c "SELECT COUNT(*) AS trips FROM '/opt/duckdb/samples/nyc_yellow_taxi_2024_01.parquet';"
┌─────────┐
│  trips  │
│  int64  │
├─────────┤
│ 2964624 │
└─────────┘

DuckDB exposes a rich SQL dialect, window functions, JSON support and the read_parquet, read_csv and read_json table functions for querying files directly without loading them.

Step 8: Server Components

Component Version Source
Operating system Ubuntu 24.04 LTS AWS Ubuntu base AMI, fully patched on build
DuckDB CLI 1.5.2 github.com/duckdb/duckdb releases (SHA256 pinned)
DuckDB Python client 1.5.2 PyPI, installed in /opt/duckdb-venv
Python 3.12 Ubuntu noble distribution package
JupyterLab 4.5 PyPI, installed in /opt/duckdb-venv
pandas 2.3 PyPI
PyArrow 24 PyPI
nginx 1.24 Ubuntu noble distribution package

The Python virtual environment lives at /opt/duckdb-venv and is owned by root. The duckdb system user (created with no shell) is what runs the JupyterLab service. Notebooks, sample data and persistent databases live on the dedicated EBS data volume at /opt/duckdb.

Step 9: Query Files on Amazon S3 with the httpfs Extension

DuckDB can query CSV, parquet and JSON files directly on Amazon S3 without copying them locally. The image pre-installs the httpfs extension into the persistent database so that the customer first boot has zero outbound dependencies.

To query a parquet file in your own S3 bucket, install and load the extension (idempotent), then point a SELECT at an s3:// URL:

LOAD httpfs;
CREATE SECRET (
  TYPE S3,
  PROVIDER credential_chain
);
SELECT COUNT(*) FROM 's3://my-bucket/path/to/file.parquet';

If the EC2 instance is launched with an IAM instance profile that grants s3:GetObject on the target prefix, the credential_chain provider picks up the temporary credentials from the EC2 metadata service. For long-lived static credentials, use the KEY and SECRET parameters of CREATE SECRET.

Step 10: Persist Custom Notebooks

JupyterLab serves notebooks from /opt/duckdb/notebooks, which lives on the dedicated EBS data volume. Notebooks you save from the UI land in that directory and survive instance restarts. If you create a samples/ subdirectory for parquet files or CSV exports, those also live on the data volume.

The starter notebook is mirrored in two places: the original at /opt/duckdb/samples/01-duckdb-quickstart.ipynb, and an editable copy at /opt/duckdb/notebooks/01-duckdb-quickstart.ipynb. Edits to the copy do not change the original.

Step 11: Restart and Restart-Safe State

DuckDB persistent databases are written to disk at every CHECKPOINT and on a clean shutdown. The JupyterLab systemd unit takes a clean SIGTERM on stop, so a normal systemctl stop jupyterlab.service or instance reboot does not corrupt open databases.

Check service status:

sudo systemctl status jupyterlab.service nginx.service --no-pager

Restart the JupyterLab service (for example, after editing a Python dependency):

sudo systemctl restart jupyterlab.service

Step 12: Back Up the Data Volume

The DuckDB data tier lives on the EBS volume mounted at /opt/duckdb. The simplest backup strategy is an EBS snapshot of that volume, run from your management environment with the AWS CLI:

aws ec2 create-snapshot \
  --volume-id <volume-id> \
  --description "DuckDB data backup $(date -u +%Y%m%d)" \
  --tag-specifications 'ResourceType=snapshot,Tags=[{Key=Product,Value=duckdb-aws},{Key=Type,Value=data}]'

Snapshots are incremental, so subsequent backups only store the changed blocks. A snapshot lifecycle policy under Data Lifecycle Manager in the EC2 console automates the rotation. For per-database backups, use DuckDB's EXPORT DATABASE statement to write a parquet directory you can copy to S3.

Step 13: HTTPS with a Reverse Proxy

The image serves JupyterLab on plain HTTP on port 80, because the choice of TLS termination depends on your environment. The recommended pattern is to put an AWS Application Load Balancer in front of the instance and let it terminate TLS with an ACM certificate; the ALB target group forwards plain HTTP to port 80 on the instance.

An alternative is a sidecar reverse proxy on the same host. The image does not include certbot, but you can install it manually and run it against a domain you own. Replace <your-domain> with your real domain:

sudo apt-get update
sudo apt-get install -y certbot python3-certbot-nginx
sudo certbot --nginx -d <your-domain> --redirect --non-interactive --agree-tos -m <your-email>

When certbot rewrites the nginx configuration it preserves the basic-authentication block, so the JupyterLab interface remains password protected over HTTPS as well.

Step 14: Rotate the JupyterLab Password

The per-instance password is generated on first boot. To rotate it later, regenerate the htpasswd entry and update the credentials file:

NEW_PASSWORD="$(openssl rand -hex 16)"
sudo htpasswd -bc /etc/nginx/.duckdb-htpasswd duckdb "${NEW_PASSWORD}"
sudo chown root:www-data /etc/nginx/.duckdb-htpasswd
sudo chmod 0640 /etc/nginx/.duckdb-htpasswd
sudo systemctl reload nginx
echo "New JupyterLab password: ${NEW_PASSWORD}"

Update /stage/scripts/duckdb-credentials.log if you want the canonical record to match.

Step 15: Operating System Updates

The image ships with Ubuntu unattended-upgrades enabled, so security updates land automatically. To apply outstanding updates immediately:

sudo apt-get update
sudo apt-get -y upgrade

A kernel update may require a reboot. Schedule any reboots through your normal change-control process.

Security Considerations

  • Restrict port 80 to trusted networks. JupyterLab executes arbitrary Python; the basic-authentication password is the only barrier. Open the port only to your office, VPN, or VPC private subnets.
  • Use an IAM instance role for AWS access. If your notebooks talk to S3, DynamoDB or other AWS services, attach an instance profile rather than baking long-lived access keys into the image.
  • The duckdb system user has no shell. It cannot be SSHed into and runs only the JupyterLab process. Do not loosen its permissions.
  • Treat the credentials file as a secret. /stage/scripts/duckdb-credentials.log is mode 0600, root-owned. Do not copy it into customer-readable locations.
  • Plan for HTTPS. Plain HTTP on port 80 is acceptable inside a private VPC but not over the public internet. Use the ALB pattern in step 13 for production.

Screenshots

JupyterLab launcher landing

The JupyterLab launcher landing page after first boot, with the bundled starter notebook visible in the file browser.

Starter notebook open

The bundled DuckDB quickstart notebook open in JupyterLab, ready to run analytical queries against the sample parquet dataset.

Analytical query result

A DuckDB analytical query against the New York City taxi parquet sample (2.96 million rows), executed from the notebook.


Support

cloudimg provides 24/7 technical support for this image, covering DuckDB deployment, notebook configuration, dataset loading, performance tuning and engine upgrades.

All product and company names are trademarks or registered trademarks of their respective holders. Use of them does not imply any affiliation with or endorsement by them.