Databases

DuckDB User Guide

| Product: DuckDB

Overview

This guide covers the deployment and use of DuckDB on Linux using cloudimg AMIs from the AWS Marketplace. DuckDB is a fast, in process analytical database management system designed for running complex analytical queries on large datasets. It is optimized for Online Analytical Processing (OLAP) workloads and can directly query CSV, Parquet, and JSON files.

What's included in this AMI:

  • DuckDB executable at /apps/duckdb
  • Ready for analytical workloads and data processing
  • OS package update script for keeping the system current
  • AWS CLI v2 for AWS service integration
  • Systems Manager Agent (SSM) for remote management
  • CloudWatch Agent for monitoring
  • Latest security patches applied at build time
  • 24/7 cloudimg support with guaranteed 24 hour response SLA

Prerequisites

Before launching this AMI, ensure you have:

  1. An active AWS account
  2. An active subscription to the DuckDB listing on AWS Marketplace
  3. An EC2 key pair for SSH access
  4. Familiarity with EC2 instance management and SSH

Recommended Instance Type: t3.large (2 vCPU, 8 GB RAM) or larger for analytical workloads. The minimum requirements are 1 vCPU, 1 GB RAM, and 20 GB disk space. DuckDB performance scales well with additional memory and CPU.

Step 1: Launch the AMI

  1. Navigate to the AWS Marketplace and search for "DuckDB cloudimg"
  2. Click Continue to Subscribe, accept the terms, then Continue to Configuration
  3. Select your preferred Region and Software Version
  4. Click Continue to Launch
  5. Choose Launch through EC2 for full control over instance configuration
  6. Select your instance type (t3.large recommended)
  7. Configure storage: 20 GB gp3 minimum, increase for large datasets
  8. Configure your Security Group with the following inbound rules:
Port Protocol Source Purpose
22 TCP Your IP SSH access
  1. Select your EC2 key pair and launch the instance

Step 2: Connect via SSH

Once your instance is running and has passed both status checks (2/2), connect using SSH:

ssh -i your-key.pem ec2-user@<public-ip-address>

To switch to the root user:

sudo su -

Step 3: Launch DuckDB

The DuckDB executable is located at /apps/duckdb/duckdb. Launch the interactive CLI:

/apps/duckdb/duckdb

You will see the DuckDB interactive prompt where you can run SQL queries directly.

Create an in memory database (default):

SELECT 'Hello from DuckDB on AWS!' AS message;

Create a persistent database:

/apps/duckdb/duckdb /apps/duckdb/mydata.db

Working with DuckDB

Create a table and insert data:

CREATE TABLE sales (
    id INTEGER,
    product VARCHAR,
    amount DECIMAL(10,2),
    sale_date DATE
);

INSERT INTO sales VALUES
    (1, 'Widget A', 29.99, '2024-01-15'),
    (2, 'Widget B', 49.99, '2024-01-16'),
    (3, 'Widget A', 29.99, '2024-01-17');

SELECT product, SUM(amount) as total_revenue, COUNT(*) as num_sales
FROM sales
GROUP BY product
ORDER BY total_revenue DESC;

Query CSV files directly:

SELECT * FROM read_csv_auto('/path/to/data.csv');

Query Parquet files:

SELECT * FROM read_parquet('/path/to/data.parquet');

Query JSON files:

SELECT * FROM read_json_auto('/path/to/data.json');

Query data directly from S3:

SELECT * FROM read_parquet('s3://your-bucket/data.parquet');

Export query results to CSV:

COPY (SELECT * FROM sales) TO '/tmp/sales_export.csv' (HEADER, DELIMITER ',');

Export to Parquet:

COPY (SELECT * FROM sales) TO '/tmp/sales_export.parquet' (FORMAT PARQUET);

Server Components

Component Install Path
DuckDB /apps/duckdb

Note: Component versions may be updated on first boot by the automatic OS package update script.

Filesystem Layout

Mount Point Size Description
/ 38 GB Root filesystem
/boot 2 GB Operating system kernel files
/apps 9.8 GB DuckDB installation directory

Key directories:

Directory Purpose
/apps/duckdb DuckDB installation and executable
/apps/duckdb/duckdb DuckDB binary executable

Scripts and Log Files

Script/Log Path Description
initial_boot_update.sh /stage/scripts Updates the OS with the latest packages on first boot
initial_boot_update.log /stage/scripts Output log for the boot update script

On Startup

An OS package update script runs on first boot to ensure the image is fully up to date. You can disable this by removing the script and its crontab entry:

rm -f /stage/scripts/initial_boot_update.sh

crontab -e
# Delete the following line, save and exit:
@reboot /stage/scripts/initial_boot_update.sh

Troubleshooting

DuckDB command not found

  1. Use the full path: /apps/duckdb/duckdb
  2. Or add it to your PATH: export PATH=$PATH:/apps/duckdb
  3. Verify the binary exists: ls -la /apps/duckdb/duckdb

Out of memory when querying large datasets

  1. DuckDB processes data in memory; upgrade to a larger instance type
  2. Use PRAGMA memory_limit='4GB'; to set explicit memory limits
  3. Process data in smaller batches using LIMIT and OFFSET

Cannot read S3 files

  1. Configure AWS credentials: ensure the instance has an IAM role with S3 read access
  2. Or set credentials in DuckDB: sql SET s3_region='us-east-1'; SET s3_access_key_id='your-key'; SET s3_secret_access_key='your-secret';

Permission denied on database file

  1. Check file ownership and permissions
  2. Ensure the user has write access to the directory
  3. Try creating the database in a different location: /tmp/test.db

Security Recommendations

  • Restrict SSH access: Only allow port 22 from trusted IP addresses
  • Use IAM roles for S3 access: Avoid storing AWS credentials on the instance
  • Protect database files: Set appropriate file permissions on persistent database files
  • Keep DuckDB updated: Check for new versions with security fixes
  • Enable Windows Update: Keep the OS patched with the latest security updates
  • Back up data: Regularly back up database files to S3

Support

If you encounter any issues with this product, contact cloudimg support:

  • Email: support@cloudimg.co.uk
  • Website: www.cloudimg.co.uk
  • Support hours: 24/7 with guaranteed 24 hour response SLA