Analytics AWS

Apache Spark on AWS User Guide

Last updated: 2026-05-22 | Product: Apache Spark on AWS

Overview

This image runs Apache Spark 4.1.2 (built for Hadoop 3) as a single node standalone cluster. A Spark master and one Spark worker are pre configured and supervised by systemd, each running under a dedicated unprivileged spark system user. PySpark, Spark SQL, and the Scala based Spark Shell are ready to use. Java 17 and Python 3 are bundled. You launch the instance, open the master web UI, and run your first spark-submit job in minutes.

The image is intended for teams that want a working analytics engine on day one, without spending hours reconciling Java, Python, and Spark versions or wiring up systemd units. It is not a multi node production cluster, it does not ship with Spark authentication enabled out of the box, and it does not include a Spark History Server. Section 13 documents the recommended path for enabling authentication before you put real production traffic through the cluster, and Section 14 covers enabling the Spark History Server after launch.

Apache Spark standalone has no built in authentication and ships no shared or default credentials. There is nothing to retrieve and no password baked into the image. On first boot a one shot service writes a short non secret information file at /stage/scripts/spark-info.log describing the cluster, and then marks itself complete.

The brand is lowercase cloudimg throughout this guide. All cloudimg URLs in this guide use the form https://www.cloudimg.co.uk.

Prerequisites

Before you deploy this image you need:

An Amazon Web Services account where you can launch EC2 instances
IAM permissions to launch instances, create security groups, and subscribe to AWS Marketplace products
An EC2 key pair in the target Region for SSH access to the instance
A VPC and subnet in the target Region, with a security group allowing inbound port 22 from your management network and inbound port 8080 from the small set of trusted IP addresses that will view the Spark master web UI
The AWS CLI (version 2) installed locally if you plan to deploy from the command line

Step 1: Launch the Instance from the AWS Marketplace

Sign in to the AWS Management Console, open the EC2 service, and select Launch instance. Under Application and OS Images choose AWS Marketplace AMIs and search for Apache Spark. Select the cloudimg listing and choose Select, then Continue on the subscription summary.

Pick an instance type of m5.xlarge or larger. Spark executor memory pressure dominates any realistic workload, so 4 vCPU and 16 GiB of RAM is a practical minimum. Smaller types work for evaluation. Choose your EC2 key pair under Key pair (login). Under Network settings select your VPC and subnet, and either create or select a security group that allows inbound port 22 from your management network and inbound port 8080 from the trusted IP addresses that will view the Spark master web UI. Leave the root volume at the default size or larger.

Select Launch instance. First boot initialisation takes a few seconds after the instance state becomes Running and the status checks pass. The Spark master and worker start automatically.

Step 2: Launch the Instance from the AWS CLI

The following block launches an instance from the cloudimg Apache Spark Marketplace AMI into an existing subnet and security group. Replace <ami-id> with the AMI ID shown on the Marketplace listing, <key-name> with your EC2 key pair name, <subnet-id> with your subnet ID, and <security-group-id> with a security group that opens ports 22 and 8080 as described above.

aws ec2 run-instances \
  --image-id <ami-id> \
  --instance-type m5.xlarge \
  --key-name <key-name> \
  --subnet-id <subnet-id> \
  --security-group-ids <security-group-id> \
  --block-device-mappings '[{"DeviceName":"/dev/sda1","Ebs":{"VolumeSize":30,"VolumeType":"gp3"}}]' \
  --metadata-options 'HttpTokens=required' \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=apache-spark-01}]'

The command prints a JSON document on success. Note the instance ID, then retrieve its public address once it is running with aws ec2 describe-instances --instance-ids <instance-id> --query "Reservations[].Instances[].PublicIpAddress" --output text.

The Spark master RPC port 7077 is intentionally not in any security group rule. Customers reach the cluster via spark-submit executed on the instance itself, not across the network. Add a 7077 inbound rule only if you are deliberately running Spark drivers on another host and you have first enabled Spark authentication per Section 13.

Step 3: Connect via SSH

Connect over SSH with the key pair you selected and the public IP address from step 2. The SSH login user depends on the operating system of the AMI variant you launched:

AMI variant	SSH login user
Apache Spark 4 on Ubuntu 24.04	`ubuntu`

ssh <login-user>@<public-ip>

The first login may take a few seconds while cloud init finalises. Once you have a shell, the Spark master and worker services have already been started by systemd. A short non secret information file describing the cluster is available at /stage/scripts/spark-info.log:

cat /stage/scripts/spark-info.log

It records the Spark version, the master URL, and the master web UI address. There are no credentials in this file because Apache Spark standalone ships with authentication disabled.

Step 4: Switch to the spark user and Source the Environment

The master and worker daemons run as the dedicated spark system user. For interactive work (spark-shell, spark-submit, pyspark) switch into that account and source the convenience shim:

sudo -iu spark
source ~/setEnv.sh
echo "$SPARK_HOME"

Expected output:

/opt/spark

The shim exports SPARK_HOME, JAVA_HOME, PYSPARK_PYTHON, and prepends $SPARK_HOME/bin and $SPARK_HOME/sbin to your PATH. Every subsequent step in this guide assumes you have sourced setEnv.sh in your current shell.

Step 5: Confirm the Standalone Cluster Is Running

The image ships with both master and worker enabled under systemd. After a fresh launch they are already up. Verify:

sudo systemctl status spark-master.service
sudo systemctl status spark-worker.service
ss -tlnp | grep -E ':8080|:7077'

Expected output (abridged):

Active: active (running)
...
LISTEN 0 1    *:8080                  *:*
LISTEN 0 4096 [::ffff:127.0.0.1]:7077 *:*

The Spark master RPC port 7077 is intentionally bound to loopback only so the image without authentication cannot be reached over the network. The master web UI on 8080 is bound to all interfaces; restrict it to trusted IPs via the security group.

If either service is inactive, start them both with the supplied helper script in the spark user's home directory:

~/start_spark.sh

To stop the cluster cleanly use ~/stop_spark.sh (which stops the worker first, then the master). Both helpers call systemctl and are run from the spark shell you entered in Step 4.

Step 6: Open the Spark Master Web UI

The Spark master web UI is served on port 8080. From a browser on one of the trusted IP addresses you added to the security group, open:

http://<public-ip>:8080

You see a page titled Spark Master at spark://localhost:7077 with one worker listed under Workers in the ALIVE state. The page also lists every running and completed application.

Spark master web UI cluster overview

The header gives the number of workers, the total and used cores, and the total and used memory. The Workers table shows each worker, its address, state, cores, and memory. The Running Applications and Completed Applications tables list every application the cluster has accepted. If the page loads but no workers are listed, see Section 15.

Step 7: Inspect the Worker

Select the worker ID link in the Workers table to open the worker's own web UI, served on port 8081. The worker page shows the resources the worker contributes to the cluster and the executors it is currently running.

Spark worker detail page

The Running Executors table lists the executors the worker has launched for active applications, each with its core and memory allocation and links to the executor stdout and stderr logs. The Finished Executors table records executors from applications that have completed. Like port 4040, port 8081 is not opened by the security group by default; reach the worker UI through the master UI link from a trusted host, or add an 8081 inbound rule scoped to your management IP addresses.

Step 8: Run Your First Job with spark-submit

The Spark distribution ships a built in SparkPi example that estimates pi using a Monte Carlo simulation across executors. With the environment from Step 4 sourced, spark-submit is on your PATH. Run SparkPi against the standalone cluster from the spark shell:

spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://localhost:7077 \
  --conf spark.driver.host=localhost \
  $SPARK_HOME/examples/jars/spark-examples_*.jar 10

Expected output includes a line of the form:

Pi is roughly 3.141...

Surrounding INFO log lines are normal Spark output and can be ignored. The final Pi estimate accuracy depends on the number of sample partitions passed as the final argument (10 in this example).

While a job is running, the Spark master web UI lists it under Running Applications, and the per application web UI on port 4040 shows the job, its stages, and its executors.

Spark application detail page in the web UI

The application UI breaks the run down into jobs and stages, with a task progress bar for each. The Stages, Storage, Environment, and Executors tabs across the top give the full execution detail. The application UI is served only while a driver is active and tears down when the job exits.

Step 9: Run a PySpark Job

From the spark shell, start an interactive PySpark session against the cluster:

pyspark --master spark://localhost:7077 --conf spark.driver.host=localhost

Once the REPL is ready:

df = spark.range(1, 1000001)
print("sum:", df.selectExpr("sum(id) as s").collect()[0]["s"])

Expected output:

sum: 500000500000

Exit the shell with exit() or Ctrl D.

Step 10: Run a Spark SQL Query

Spark SQL is available both from spark-shell and through spark-sql directly. From the spark shell:

spark-sql --master spark://localhost:7077 --conf spark.driver.host=localhost

Once the prompt is ready:

CREATE TABLE cloudimg_counter (n INT) USING PARQUET;
INSERT INTO cloudimg_counter VALUES (1), (2), (3), (4), (5);
SELECT SUM(n) AS total FROM cloudimg_counter;

Expected output of the final query:

Drop the table when you are done: DROP TABLE cloudimg_counter;. Spark SQL stores table data under the warehouse directory /opt/spark-data/warehouse, which is on the dedicated data volume described in Section 12.

Step 11: Start, Stop, and Check Status

The master and worker services are started by systemd at boot. Manage them as follows:

# Status
sudo systemctl status spark-master.service
sudo systemctl status spark-worker.service

# Stop (worker first, then master)
sudo systemctl stop spark-worker.service
sudo systemctl stop spark-master.service

# Start (master first, then worker)
sudo systemctl start spark-master.service
sudo systemctl start spark-worker.service

# Tail live logs
sudo journalctl -u spark-master.service -f
sudo journalctl -u spark-worker.service -f

On disk log files for the daemons are written to /opt/spark-data/logs. The spark-worker.service unit declares Requires=spark-master.service, so stopping the master automatically stops the worker too.

Step 12: The Dedicated Data Volume

The image attaches a separate, independently resizable EBS volume mounted at /opt/spark-data. Keeping Spark's working data off the operating system disk means you can grow the data volume on its own and snapshot it independently. The volume holds three directories:

Path	Purpose
`/opt/spark-data/work`	Spark worker work directory: application jars, and per executor `stdout` and `stderr`
`/opt/spark-data/warehouse`	Spark SQL warehouse: data files for tables created with Spark SQL
`/opt/spark-data/logs`	Spark master and worker daemon log files

The volume is recorded in /etc/fstab by its filesystem UUID with the nofail option, so it mounts automatically on every boot and the instance still boots if the volume is ever detached. To grow it, modify the EBS volume in the AWS console or with aws ec2 modify-volume, then run sudo resize2fs /dev/nvme1n1 on the instance.

For shuffle heavy workloads you can also point Spark's local scratch directory at the data volume:

sudo systemctl stop spark-worker.service
sudo systemctl stop spark-master.service
sudo mkdir -p /opt/spark-data/local
sudo chown spark:spark /opt/spark-data/local
echo 'spark.local.dir                  /opt/spark-data/local' | sudo tee -a /opt/spark/conf/spark-defaults.conf
sudo systemctl start spark-master.service
sudo systemctl start spark-worker.service

spark.local.dir is where Spark writes shuffle blocks and cached RDD partitions when memory pressure forces a spill.

Step 13: Enable Spark Authentication Before Production

The shipped image has Spark authentication disabled. Any client that can reach port 7077 can submit arbitrary jobs. Do not expose the master RPC port beyond the instance itself until you have completed this step.

Spark standalone authentication uses a shared secret that every cluster participant must present:

# Generate a shared secret
SECRET="$(openssl rand -hex 32)"

# Append to spark-defaults.conf
sudo tee -a /opt/spark/conf/spark-defaults.conf >/dev/null <<EOF
spark.authenticate               true
spark.authenticate.secret        ${SECRET}
EOF
sudo chown spark:spark /opt/spark/conf/spark-defaults.conf
sudo chmod 0640 /opt/spark/conf/spark-defaults.conf

# Restart
sudo systemctl restart spark-master.service
sudo systemctl restart spark-worker.service

Every spark-submit, pyspark, spark-shell, and spark-sql invocation must then include --conf spark.authenticate=true --conf spark.authenticate.secret=<your-secret>. Store the secret somewhere safe. Any job that does not present it is rejected at handshake time. The Apache Spark security documentation at https://spark.apache.org/docs/latest/security.html is the authoritative reference for the full set of options including TLS encryption, event log protection, and RPC SASL.

Step 14: Optional, Enable the Spark History Server

The shipped image does not run the Spark History Server (port 18080) by default to keep the attack surface small. To enable it after launch:

# Create an event log directory on the data volume
sudo mkdir -p /opt/spark-data/events
sudo chown spark:spark /opt/spark-data/events

# Turn on event logging
sudo tee -a /opt/spark/conf/spark-defaults.conf >/dev/null <<'EOF'
spark.eventLog.enabled           true
spark.eventLog.dir               file:/opt/spark-data/events
spark.history.fs.logDirectory    file:/opt/spark-data/events
EOF

# Start the history server as the spark user
sudo -u spark /opt/spark/sbin/start-history-server.sh

Then add a security group rule for TCP 18080 scoped to your trusted management IP addresses and visit http://<public-ip>:18080. To persist the history server across reboots, wrap start-history-server.sh in a systemd unit modelled on spark-master.service.

Step 15: Troubleshooting

Master web UI on port 8080 does not load. Check sudo systemctl status spark-master.service. Check ss -tlnp | grep :8080 on the instance, the master should be listening on *:8080. Then confirm your security group actually allows your client IP on TCP 8080 inbound, and that your public IP is the one in the allow rule.

Worker is not registering with the master. Look at the worker log under /opt/spark-data/logs/ or run sudo journalctl -u spark-worker.service -n 200 --no-pager. The usual cause is a mismatched master URL. The worker unit uses spark://localhost:7077, and the master must be running on the same host.

spark-submit fails with ClassNotFoundException for org.apache.spark.examples.SparkPi. The examples jar path expands with a glob, so confirm the actual jar is present:

ls /opt/spark/examples/jars/spark-examples*.jar

Use the full path printed above. If the file is missing, the install was interrupted and you should relaunch from the AMI rather than patching in place.

Java version mismatch. Spark 4.1 requires Java 17 or later. Run java -version, it should report 17.x. If not, sudo update-alternatives --config java and select the java-17-openjdk entry.

Out of memory during spark-submit. The default driver is 1g and executor is 2g (see spark-defaults.conf). On m5.xlarge (16 GiB) you can raise spark.driver.memory to 4g and spark.executor.memory to 8g. For larger memory requirements move to m5.2xlarge or a memory optimised family.

Step 16: Security Recommendations

Enable Spark authentication immediately after first boot if port 7077 will be reachable by anything other than the instance itself (Section 13)
Restrict the security group on TCP 8080 to the smallest possible source CIDR
Do not expose TCP 7077 beyond the instance in standalone single node mode
Consider TLS encryption for RPC traffic per the Apache Spark security documentation when moving beyond single node deployments
Rotate the shared secret used for Spark authentication on a regular schedule
Take regular EBS snapshots of the OS disk and the /opt/spark-data data volume
Keep the operating system patched. Ubuntu's unattended-upgrades is enabled by default, so security patches apply automatically
Subscribe to the Apache Spark announce mailing list and apply security patches as they are published

Support

cloudimg provides 24/7/365 expert technical support for this image. Guaranteed response within 24 hours, one hour average for critical issues. Contact support@cloudimg.co.uk.

Apache Spark is licensed under the Apache License 2.0. The full text is reproduced in /opt/spark/LICENSE on the deployed image. cloudimg distributes the unmodified upstream Apache Spark 4.1.2 (bin-hadoop3) binary tarball published by the Apache Software Foundation, with cloudimg authored configuration, systemd units, and first boot tooling layered on top. For Apache Spark questions outside the scope of cloudimg packaging, the Apache Spark community resources at https://spark.apache.org/community.html are the authoritative reference.