DuckDB User Guide
Overview
This guide covers the deployment and use of DuckDB on Linux using cloudimg AMIs from the AWS Marketplace. DuckDB is a fast, in process analytical database management system designed for running complex analytical queries on large datasets. It is optimized for Online Analytical Processing (OLAP) workloads and can directly query CSV, Parquet, and JSON files.
What's included in this AMI:
- DuckDB executable at /apps/duckdb
- Ready for analytical workloads and data processing
- OS package update script for keeping the system current
- AWS CLI v2 for AWS service integration
- Systems Manager Agent (SSM) for remote management
- CloudWatch Agent for monitoring
- Latest security patches applied at build time
- 24/7 cloudimg support with guaranteed 24 hour response SLA
Prerequisites
Before launching this AMI, ensure you have:
- An active AWS account
- An active subscription to the DuckDB listing on AWS Marketplace
- An EC2 key pair for SSH access
- Familiarity with EC2 instance management and SSH
Recommended Instance Type: t3.large (2 vCPU, 8 GB RAM) or larger for analytical workloads. The minimum requirements are 1 vCPU, 1 GB RAM, and 20 GB disk space. DuckDB performance scales well with additional memory and CPU.
Step 1: Launch the AMI
- Navigate to the AWS Marketplace and search for "DuckDB cloudimg"
- Click Continue to Subscribe, accept the terms, then Continue to Configuration
- Select your preferred Region and Software Version
- Click Continue to Launch
- Choose Launch through EC2 for full control over instance configuration
- Select your instance type (
t3.largerecommended) - Configure storage: 20 GB gp3 minimum, increase for large datasets
- Configure your Security Group with the following inbound rules:
| Port | Protocol | Source | Purpose |
|---|---|---|---|
| 22 | TCP | Your IP | SSH access |
- Select your EC2 key pair and launch the instance
Step 2: Connect via SSH
Once your instance is running and has passed both status checks (2/2), connect using SSH:
ssh -i your-key.pem ec2-user@<public-ip-address>
To switch to the root user:
sudo su -
Step 3: Launch DuckDB
The DuckDB executable is located at /apps/duckdb/duckdb. Launch the interactive CLI:
/apps/duckdb/duckdb
You will see the DuckDB interactive prompt where you can run SQL queries directly.
Create an in memory database (default):
SELECT 'Hello from DuckDB on AWS!' AS message;
Create a persistent database:
/apps/duckdb/duckdb /apps/duckdb/mydata.db
Working with DuckDB
Create a table and insert data:
CREATE TABLE sales (
id INTEGER,
product VARCHAR,
amount DECIMAL(10,2),
sale_date DATE
);
INSERT INTO sales VALUES
(1, 'Widget A', 29.99, '2024-01-15'),
(2, 'Widget B', 49.99, '2024-01-16'),
(3, 'Widget A', 29.99, '2024-01-17');
SELECT product, SUM(amount) as total_revenue, COUNT(*) as num_sales
FROM sales
GROUP BY product
ORDER BY total_revenue DESC;
Query CSV files directly:
SELECT * FROM read_csv_auto('/path/to/data.csv');
Query Parquet files:
SELECT * FROM read_parquet('/path/to/data.parquet');
Query JSON files:
SELECT * FROM read_json_auto('/path/to/data.json');
Query data directly from S3:
SELECT * FROM read_parquet('s3://your-bucket/data.parquet');
Export query results to CSV:
COPY (SELECT * FROM sales) TO '/tmp/sales_export.csv' (HEADER, DELIMITER ',');
Export to Parquet:
COPY (SELECT * FROM sales) TO '/tmp/sales_export.parquet' (FORMAT PARQUET);
Server Components
| Component | Install Path |
|---|---|
| DuckDB | /apps/duckdb |
Note: Component versions may be updated on first boot by the automatic OS package update script.
Filesystem Layout
| Mount Point | Size | Description |
|---|---|---|
| / | 38 GB | Root filesystem |
| /boot | 2 GB | Operating system kernel files |
| /apps | 9.8 GB | DuckDB installation directory |
Key directories:
| Directory | Purpose |
|---|---|
| /apps/duckdb | DuckDB installation and executable |
| /apps/duckdb/duckdb | DuckDB binary executable |
Scripts and Log Files
| Script/Log | Path | Description |
|---|---|---|
| initial_boot_update.sh | /stage/scripts | Updates the OS with the latest packages on first boot |
| initial_boot_update.log | /stage/scripts | Output log for the boot update script |
On Startup
An OS package update script runs on first boot to ensure the image is fully up to date. You can disable this by removing the script and its crontab entry:
rm -f /stage/scripts/initial_boot_update.sh
crontab -e
# Delete the following line, save and exit:
@reboot /stage/scripts/initial_boot_update.sh
Troubleshooting
DuckDB command not found
- Use the full path:
/apps/duckdb/duckdb - Or add it to your PATH:
export PATH=$PATH:/apps/duckdb - Verify the binary exists:
ls -la /apps/duckdb/duckdb
Out of memory when querying large datasets
- DuckDB processes data in memory; upgrade to a larger instance type
- Use
PRAGMA memory_limit='4GB';to set explicit memory limits - Process data in smaller batches using
LIMITandOFFSET
Cannot read S3 files
- Configure AWS credentials: ensure the instance has an IAM role with S3 read access
- Or set credentials in DuckDB:
sql SET s3_region='us-east-1'; SET s3_access_key_id='your-key'; SET s3_secret_access_key='your-secret';
Permission denied on database file
- Check file ownership and permissions
- Ensure the user has write access to the directory
- Try creating the database in a different location:
/tmp/test.db
Security Recommendations
- Restrict SSH access: Only allow port 22 from trusted IP addresses
- Use IAM roles for S3 access: Avoid storing AWS credentials on the instance
- Protect database files: Set appropriate file permissions on persistent database files
- Keep DuckDB updated: Check for new versions with security fixes
- Enable Windows Update: Keep the OS patched with the latest security updates
- Back up data: Regularly back up database files to S3
Support
If you encounter any issues with this product, contact cloudimg support:
- Email: support@cloudimg.co.uk
- Website: www.cloudimg.co.uk
- Support hours: 24/7 with guaranteed 24 hour response SLA