Streaming AWS

Apache Flink on AWS User Guide

| Product: Apache Flink on AWS

Overview

Apache Flink is the open source framework and distributed processing engine for stateful computations over unbounded and bounded data streams. This image runs Flink as a single node standalone cluster: a JobManager and a TaskManager run together on one instance, each as its own systemd service, ready for development, testing, single tenant production streaming, and proof of concept workloads.

The Flink distribution is installed at /opt/flink (a symlink to the versioned /opt/flink-2.2.1 directory). A dedicated unprivileged flink system user owns the install and runs both services. Flink's runtime working directories, the checkpoint and savepoint directories, the web upload directory, the temporary directories and the logs, live on a separate EBS data volume mounted at /opt/flink-data so that the working set can be resized independently of the operating system disk.

The JobManager Web Dashboard, which is also the Flink REST API, listens on port 8081. Apache Flink has no built in user accounts or password authentication, so there is no per instance secret to retrieve. On the first boot of every deployed instance a one shot service records the dashboard address and the cluster paths to an information file at /stage/scripts/flink-info.log and confirms the cluster is up. Because Flink itself is unauthenticated, the security responsibility is yours: restrict the port 8081 security group rule to trusted networks, or place the dashboard behind a TLS terminating load balancer.

Prerequisites

Before you deploy this image you need:

  • An Amazon Web Services account where you can launch EC2 instances
  • IAM permissions to launch instances, create security groups, and subscribe to AWS Marketplace products
  • An EC2 key pair in the target Region for SSH access to the instance
  • A VPC and subnet in the target Region, with a security group allowing inbound port 22 from your management network and inbound port 8081 from the trusted networks that need the Flink Web Dashboard
  • The AWS CLI (version 2) installed locally if you plan to deploy from the command line

Step 1: Launch the Instance from the AWS Marketplace

Sign in to the AWS Management Console, open the EC2 service, and select Launch instance. Under Application and OS Images choose AWS Marketplace AMIs and search for Apache Flink. Select the cloudimg listing and choose Select, then Continue on the subscription summary.

Pick an instance type of m5.large or larger — stream processing is memory and CPU sensitive. Choose your EC2 key pair under Key pair (login). Under Network settings select your VPC and subnet, and either create or select a security group that allows inbound port 22 from your management network and inbound port 8081 from the trusted networks that need the Web Dashboard. Leave the root volume at the default size or larger.

Select Launch instance. First boot initialisation takes well under a minute after the instance state becomes Running and the status checks pass.

Step 2: Launch the Instance from the AWS CLI

The following block launches an instance from the cloudimg Apache Flink Marketplace AMI into an existing subnet and security group. Replace <ami-id> with the AMI ID shown on the Marketplace listing, <key-name> with your EC2 key pair name, <subnet-id> with your subnet ID, and <security-group-id> with a security group that opens ports 22 and 8081 as described above.

aws ec2 run-instances \
  --image-id <ami-id> \
  --instance-type m5.large \
  --key-name <key-name> \
  --subnet-id <subnet-id> \
  --security-group-ids <security-group-id> \
  --block-device-mappings '[{"DeviceName":"/dev/sda1","Ebs":{"VolumeSize":30,"VolumeType":"gp3"}}]' \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=apache-flink-01}]'

The command prints a JSON document on success. Note the instance ID, then retrieve its public address once it is running with aws ec2 describe-instances --instance-ids <instance-id> --query "Reservations[].Instances[].PublicIpAddress" --output text.

Step 3: Connect and Review the Cluster Information File

Connect over SSH with the key pair you selected and the public IP address from step 2. The SSH login user depends on the operating system of the AMI variant you launched:

AMI variant SSH login user
Apache Flink 2 on Ubuntu 24.04 ubuntu

The first boot service runs before the SSH daemon becomes ready, so the cluster information file is always in place when you log in for the first time.

sudo cat /stage/scripts/flink-info.log

The file is plain text. It records the dashboard URL, the REST API endpoint, the Flink home directory, the data volume mount point, and the names of the two systemd services. Apache Flink has no built in passwords, so this file holds no secret — it is an operator reference.

Step 4: Check the Flink Services and Version

The JobManager and the TaskManager each run as a systemd service. Confirm both are active:

sudo systemctl is-active flink-jobmanager.service flink-taskmanager.service

Both lines report active. Check the installed Flink release with the command line client:

cd /tmp && sudo -u flink /opt/flink/bin/flink --version

The Java runtime that Flink runs on is OpenJDK 17:

java -version

Step 5: REST API Health

The JobManager REST API is the programmatic face of the cluster. The overview endpoint reports the cluster health in one call:

curl -fsS http://127.0.0.1:8081/overview

The JSON response returns flink-version, the count of registered TaskManagers, the total and available task slots, and counters for running, finished, cancelled, and failed jobs. The registered TaskManagers are listed in detail at the /taskmanagers endpoint:

curl -fsS http://127.0.0.1:8081/taskmanagers | head -c 320

Step 6: The Flink Web Dashboard — Cluster Overview

Open a web browser and navigate to http://<public-ip>:8081/. The Web Dashboard opens on the cluster Overview, which shows the available task slots, the running job count, the list of running jobs and the list of completed jobs, with the Flink version and commit in the header.

Apache Flink Web Dashboard cluster overview

Because Apache Flink has no authentication, anyone who can reach port 8081 has full control of the cluster, including the ability to submit and cancel jobs. Keep the port 8081 security group rule restricted to trusted networks, and read Step 12 before exposing the dashboard more widely.

Step 7: Task Managers

The Task Managers page in the left navigation lists every registered TaskManager with its path and ID, data port, last heartbeat, total and free task slots, CPU cores and memory. On the cloudimg image you start with one TaskManager exposing two task slots.

Apache Flink Web Dashboard Task Managers page

Selecting a TaskManager opens its detail view with per TaskManager metrics, logs, and thread dump. You scale the cluster out by adding more TaskManager processes on additional instances and pointing them at this JobManager — see Step 11.

Step 8: Submit a Job from the Command Line

The Flink command line client at /opt/flink/bin/flink submits, lists and cancels jobs. The Flink distribution ships a set of example jobs under /opt/flink/examples. The streaming TopSpeedWindowing example is a continuous job that is useful for confirming the cluster end to end. Submit it in detached mode so the client returns immediately:

cd /tmp && sudo -u flink /opt/flink/bin/flink run -d /opt/flink/examples/streaming/TopSpeedWindowing.jar

The client prints the assigned JobID. List the running jobs to confirm it was accepted:

cd /tmp && sudo -u flink /opt/flink/bin/flink list

Step 9: Watch a Running Job in the Dashboard

Back in the Web Dashboard, select Jobs then Running Jobs and open the job you submitted. The job view shows the job graph — the streaming dataflow as a directed graph of operators — along with the job state, the start time and duration, and a per subtask metrics table with records and bytes flowing through each operator.

A running Apache Flink job shown in the Web Dashboard

The Submit New Job page in the left navigation does the same thing as the command line client through the browser: upload a job JAR, set the entry class, the program arguments and the parallelism, and submit. Job submission and cancellation through the dashboard are both enabled on this image.

Step 10: Cancel a Job

Cancel a running job from the command line by passing its JobID to flink cancel. The example below cancels every job currently running, so use it only when you intend to stop all of them. Replace the loop with a single flink cancel <job-id> to stop one specific job.

cd /tmp
for jid in $(sudo -u flink /opt/flink/bin/flink list -r 2>/dev/null | grep -oE '[0-9a-f]{32}'); do
  sudo -u flink /opt/flink/bin/flink cancel "$jid"
done

A job can also be cancelled from its job view in the Web Dashboard with the Cancel Job control in the top right.

Step 11: Scale Out and Tune the Cluster

The cluster configuration is /opt/flink/conf/config.yaml. After editing it, restart the services so the change takes effect:

sudo systemctl restart flink-jobmanager.service flink-taskmanager.service

The defaults on this image are sized for a modest instance:

  • jobmanager.memory.process.size: 1600m
  • taskmanager.memory.process.size: 2048m
  • taskmanager.numberOfTaskSlots: 2
  • parallelism.default: 1

For a larger instance type, raise taskmanager.memory.process.size and taskmanager.numberOfTaskSlots to match the available memory and vCPUs. To scale beyond one instance, launch further instances, install Flink, and start additional TaskManager processes pointed at this JobManager's RPC address — the cluster then has more task slots and can run jobs at higher parallelism. For production workloads, run the JobManager in a highly available configuration with multiple JobManagers and a durable metadata store, as described in the official Flink documentation.

Step 12: Secure the Dashboard Port

Apache Flink does not authenticate Web Dashboard or REST API requests, so the network is your security boundary. Take at least one of these measures before the instance is reachable from any untrusted network:

  • Keep the port 8081 inbound rule in the instance's security group restricted to specific trusted CIDR ranges, never 0.0.0.0/0
  • Place the dashboard behind an Application Load Balancer that terminates TLS and is itself protected, for example by AWS WAF or by an authentication layer
  • Reach the dashboard over an SSH tunnel and leave port 8081 closed to the internet entirely:
ssh -L 8081:127.0.0.1:8081 <login-user>@<public-ip>

With the tunnel open, browse to http://127.0.0.1:8081/ on your workstation.

Step 13: Checkpoints and Savepoints

Flink's fault tolerance is built on checkpoints, periodic snapshots of running job state, and savepoints, durable snapshots that you trigger explicitly for upgrades and migrations. On this image both directories live on the dedicated data volume, under /opt/flink-data/checkpoints and /opt/flink-data/savepoints.

The example below picks the first running job from the REST API and triggers a savepoint for it, or prints a message when nothing is running:

JOB_ID=$(curl -fsS -m 5 http://127.0.0.1:8081/jobs/overview | grep -oE '"jid":"[^"]+"' | head -1 | cut -d'"' -f4)
if [ -n "$JOB_ID" ]; then cd /tmp && sudo -u flink /opt/flink/bin/flink savepoint "$JOB_ID"; else echo "no running jobs to snapshot"; fi

Step 14: Backups and Maintenance

The state that matters on a Flink instance is the savepoints, and any job JARs and configuration you have added. Archive the data volume directory and copy the archive to durable off instance storage:

sudo tar -czf /var/backups/flink-data-$(date +%F).tgz -C /opt flink-data

Ship the archive to an Amazon S3 bucket or another object store, and synchronise the savepoints directory on a schedule for off instance retention.

For kernel and package updates, Ubuntu's unattended-upgrades is enabled by default, so operating system security patches apply automatically. To move to a newer Flink release, download the new binary distribution from https://flink.apache.org/, take a savepoint of each running job, stop the services, swap the /opt/flink symlink to the new versioned directory, carry your configuration across, restart the services, and resume each job from its savepoint.

Step 15: Logs and Troubleshooting

The JobManager and the TaskManager log to the systemd journal and to log files on the data volume:

sudo journalctl -u flink-jobmanager.service --no-pager -n 100
sudo journalctl -u flink-taskmanager.service --no-pager -n 100
sudo ls -1 /opt/flink-data/log/

If a job fails, its exceptions are shown on the Exceptions tab of the job view in the Web Dashboard. If the TaskManager does not appear on the Task Managers page, confirm flink-taskmanager.service is active and check its journal for connection errors to the JobManager.

Support

cloudimg provides 24/7/365 expert technical support for this image. Guaranteed response within 24 hours, one hour average for critical issues. Contact support@cloudimg.co.uk.

For general Apache Flink questions consult the official documentation at https://flink.apache.org/ and the project community resources linked from there.