Analytics Azure

Apache Spark 4.1 on Ubuntu 22.04 on Azure User Guide

| Product: Apache Spark 4.1 on Ubuntu 22.04 on Azure

Overview

This image runs Apache Spark 4.1.0 (built for Hadoop 3) as a single node standalone cluster on Ubuntu 22.04 LTS. The Spark master and one Spark worker are pre configured and managed by systemd. PySpark, SparkSQL, and the Scala-based Spark Shell are ready to use. Java 17 and Python 3 are bundled. You deploy the virtual machine, start the cluster, and run your first spark-submit job in under five minutes.

The image is intended for teams that want a working analytics engine on day one, without spending hours reconciling Java, Python, and Spark versions or wiring up systemd units. It is not a multi node production cluster, it does not ship with Spark authentication enabled out of the box, and it does not include a Spark History Server. Section 14 documents the recommended path for enabling authentication before you put real production traffic through the cluster, and Section 15 covers enabling the Spark History Server after deploy.

The brand is lowercase cloudimg throughout this guide. All cloudimg URLs in this guide use the form https://www.cloudimg.co.uk.

Prerequisites

Before you deploy this image you need:

  • A Microsoft Azure subscription where you can create resource groups, virtual networks, and virtual machines
  • Azure role permissions equivalent to Contributor on the target resource group
  • An SSH public key for first login to the admin user account
  • A virtual network and subnet in the same region as the Azure Compute Gallery the image is published into, with an associated network security group
  • The Azure CLI (az version 2.50 or later) installed locally if you intend to use the CLI deployment path in Section 2
  • The cloudimg Apache Spark 4.1 offer enabled on your tenant in Azure Marketplace

Step 1: Deploy the Virtual Machine from the Azure Portal

Navigate to Marketplace in the Azure Portal, search for Apache Spark 4.1, and select the cloudimg publisher entry. Click Create to begin the wizard.

On the Basics tab choose your subscription, target resource group, and region. The region must match the region your Azure Compute Gallery exposes the image in. Set the virtual machine name. Choose SSH public key as the authentication type, set the username to a name of your choice, and paste your SSH public key. The recommended size is Standard_D4s_v3 (4 vCPU, 16 GiB RAM) which is a practical minimum for realistic Spark jobs. Smaller sizes work but executor memory pressure will dominate any real workload.

On the Disks tab the recommended OS disk type is Premium SSD. Leave the OS disk size at the default. You can attach a separate Premium SSD data disk now if you intend to stage input data on the virtual machine, or add it later by following Section 13.

On the Networking tab select your existing virtual network and subnet. Attach a network security group that allows inbound TCP 22 from your management IP range, inbound TCP 8080 from the small set of trusted IP addresses that will view the Spark Master Web UI, and optionally inbound TCP 4040 from the same trusted IPs for the per application Web UI. Do not expose 7077 to the public internet — that is the standalone master RPC port and the OSS image ships with Spark authentication disabled.

On the Management, Monitoring, and Advanced tabs the defaults are appropriate. Click Review + create, wait for validation to pass, then click Create. Deployment takes around three minutes.

Step 2: Deploy the Virtual Machine from the Azure CLI

If you prefer the command line, use the gallery image resource identifier as the source. The exact resource identifier is published on your Partner Center plan. A representative invocation:

RG="spark-prod"
LOCATION="eastus"
VM_NAME="spark-node-1"
ADMIN_USER="sparkops"
GALLERY_IMAGE_ID="/subscriptions/<sub-id>/resourceGroups/azure-cloudimg/providers/Microsoft.Compute/galleries/cloudimgGallery/images/apache-spark-4-1-ubuntu-22-04/versions/1.0.20260417"
SSH_KEY="$(cat ~/.ssh/id_rsa.pub)"
MGMT_CIDR="<your-mgmt-cidr>"

az group create --name "$RG" --location "$LOCATION"

az network vnet create \
  --resource-group "$RG" \
  --name spark-vnet \
  --address-prefix 10.30.0.0/16 \
  --subnet-name spark-subnet \
  --subnet-prefix 10.30.1.0/24

az network nsg create --resource-group "$RG" --name spark-nsg

az network nsg rule create \
  --resource-group "$RG" --nsg-name spark-nsg \
  --name allow-ssh-mgmt --priority 100 \
  --source-address-prefixes "$MGMT_CIDR" \
  --destination-port-ranges 22 --access Allow --protocol Tcp

az network nsg rule create \
  --resource-group "$RG" --nsg-name spark-nsg \
  --name allow-master-ui --priority 110 \
  --source-address-prefixes "$MGMT_CIDR" \
  --destination-port-ranges 8080 --access Allow --protocol Tcp

az network nsg rule create \
  --resource-group "$RG" --nsg-name spark-nsg \
  --name allow-app-ui --priority 120 \
  --source-address-prefixes "$MGMT_CIDR" \
  --destination-port-ranges 4040 --access Allow --protocol Tcp

az vm create \
  --resource-group "$RG" \
  --name "$VM_NAME" \
  --image "$GALLERY_IMAGE_ID" \
  --size Standard_D4s_v3 \
  --admin-username "$ADMIN_USER" \
  --ssh-key-values "$SSH_KEY" \
  --vnet-name spark-vnet --subnet spark-subnet \
  --nsg spark-nsg \
  --public-ip-sku Standard \
  --os-disk-size-gb 64

The Spark master RPC port 7077 is intentionally not in any NSG rule above — customers reach the cluster via spark-submit executed on the virtual machine itself, not across the network. Add a 7077 allow rule only if you are deliberately running Spark drivers on another host and you have first enabled Spark authentication per Section 14.

Step 3: Connect via SSH

After deployment, find the public or private IP of the new virtual machine. From your management host:

ssh sparkops@<vm-ip>

The first login may take a few seconds while cloud init finalises. Once you have a shell, the Spark master and worker services have already been started by systemd and the first boot oneshot has already run apt-get update and upgrade in the background.

Step 4: Switch to the spark user and Source the Environment

The master and worker daemons run as the dedicated spark system user. For interactive work (spark-shell, spark-submit, pyspark) switch into that account and source the convenience shim:

sudo -iu spark
source ~/setEnv.sh
echo "$SPARK_HOME"

Expected output:

/opt/spark

The shim exports SPARK_HOME, JAVA_HOME, PYSPARK_PYTHON, and prepends $SPARK_HOME/bin and $SPARK_HOME/sbin to your PATH. Every subsequent step in this guide assumes you have sourced setEnv.sh in your current shell.

Step 5: Confirm the Standalone Cluster Is Running

The image ships with both master and worker enabled under systemd. After a fresh deployment they are already up. Verify:

sudo systemctl status spark-master.service
sudo systemctl status spark-worker.service
ss -tlnp | grep -E ':8080|:7077'

Expected output (abridged):

Active: active (running)
...
LISTEN 0 4096 [::ffff:127.0.0.1]:7077 *:*
LISTEN 0 1    *:8080                  *:*

The Spark master RPC port 7077 is intentionally bound to loopback only so the OSS image without authentication cannot be reached over the network. The Master Web UI on 8080 is bound to all interfaces; restrict it to trusted IPs via the network security group.

If either service is inactive, start them both with the supplied helper:

~/start_spark.sh

To stop the cluster cleanly use ~/stop_spark.sh (which stops the worker first, then the master).

Step 6: Open the Spark Master Web UI

The Spark Master Web UI is served on port 8080. From a browser on one of the trusted IP addresses you added to the network security group, open:

http://<vm-ip>:8080

You should see a page titled Spark Master at spark://localhost:7077 with one worker listed under Workers in the ALIVE state and Alive Workers: 1 above the table. If the page loads but no workers are listed, see Section 16.

Step 7: Server Components

The deployed image contains the following components:

Component Version Purpose
Apache Spark 4.1.0 (bin-hadoop3) Unified analytics engine, standalone master + worker
OpenJDK 17 (headless) JVM runtime for every Spark process
Python 3.10 Interpreter for PySpark driver and workers
Ubuntu 22.04 LTS Base operating system
systemd units spark-master.service, spark-worker.service Process supervision

The master and worker daemons run under the dedicated spark system user. Standalone mode means the Spark master itself is the cluster manager — there is no YARN, Kubernetes, or Mesos layer below it.

Step 8: Filesystem Layout

Path Owner Purpose
/opt/spark-4.1.0-bin-hadoop3/ spark:spark Unpacked Apache Spark tarball
/opt/spark symlink Stable upgrade path that points at the active version
/opt/spark/conf/spark-defaults.conf spark:spark 0644 Standalone master URL, driver/executor memory, Kryo serializer
/opt/spark/conf/spark-env.sh spark:spark 0755 JAVA_HOME, PYSPARK_PYTHON, master host and ports, log dir
/home/spark/setEnv.sh spark:spark 0755 Customer facing environment shim
/home/spark/start_spark.sh spark:spark 0755 Helper to start master + worker
/home/spark/stop_spark.sh spark:spark 0755 Helper to stop worker + master
/etc/systemd/system/spark-master.service root:root 0644 Master unit
/etc/systemd/system/spark-worker.service root:root 0644 Worker unit
/var/log/spark/ spark:spark 0750 Spark daemon and driver logs
/var/run/spark/ spark:spark 0755 PID files (recreated at boot via tmpfiles.d)

The Azure ephemeral resource disk is mounted at /mnt by waagent. The / root filesystem uses the default Ubuntu gallery LVM layout with /, /boot, /boot/efi, and /var partitions.

Step 9: Run Your First Job with spark-submit

The Spark distribution ships a built in SparkPi example that estimates pi using a Monte Carlo simulation across executors. Run it against the standalone cluster:

spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://localhost:7077 \
  --conf spark.driver.host=localhost \
  $SPARK_HOME/examples/jars/spark-examples_*.jar 10

Expected output includes a line of the form:

Pi is roughly 3.141...

Surrounding INFO log lines are normal Spark output and can be ignored. The final Pi estimate accuracy depends on the number of sample partitions passed as the final argument (10 in this example).

Open http://<vm-ip>:4040 while the job is running to see the per application Web UI with stage, task, and executor views. The UI is only served while a driver is active and tears down when the job exits.

Step 10: Run a PySpark Job

pyspark --master spark://localhost:7077 --conf spark.driver.host=localhost

Once the REPL is ready:

df = spark.range(1, 1000001)
print("sum:", df.selectExpr("sum(id) as s").collect()[0]["s"])

Expected output:

sum: 500000500000

Exit the shell with exit() or Ctrl D.

Step 11: Run a Spark SQL Query

Spark SQL is available both from spark-shell and through spark-sql directly:

spark-sql --master spark://localhost:7077 --conf spark.driver.host=localhost

Once the prompt is ready:

CREATE TABLE cloudimg_counter (n INT) USING PARQUET;
INSERT INTO cloudimg_counter VALUES (1), (2), (3), (4), (5);
SELECT SUM(n) AS total FROM cloudimg_counter;

Expected output of the final query:

15

Drop the table when you are done: DROP TABLE cloudimg_counter;. By default Spark SQL stores table data in spark-warehouse/ under the current working directory.

Step 12: Start, Stop, and Check Status

The master and worker services are started by systemd at boot. Manage them as follows:

# Status
sudo systemctl status spark-master.service
sudo systemctl status spark-worker.service

# Stop (worker first, then master)
sudo systemctl stop spark-worker.service
sudo systemctl stop spark-master.service

# Start (master first, then worker)
sudo systemctl start spark-master.service
sudo systemctl start spark-worker.service

# Tail live logs
sudo journalctl -u spark-master.service -f
sudo journalctl -u spark-worker.service -f

On disk log files for the daemons are written to /var/log/spark/. The spark-worker.service unit declares Requires=spark-master.service, so stopping the master automatically stops the worker too.

Step 13: Attach a Data Disk for Real Workloads

For non trivial datasets you should attach a Premium SSD data disk and point Spark at it for both input staging and shuffle spill.

# Stop the cluster
sudo systemctl stop spark-worker.service
sudo systemctl stop spark-master.service

# (In Azure) attach a new Premium SSD via the portal or `az vm disk attach`,
# then on the VM identify the new device. It will typically be /dev/sdc.
lsblk

# Format and mount
sudo mkfs.xfs /dev/sdc
sudo mkdir -p /data
echo "/dev/sdc /data xfs defaults,nofail 0 2" | sudo tee -a /etc/fstab
sudo mount /data
sudo mkdir -p /data/spark-local
sudo chown spark:spark /data/spark-local
sudo chmod 0750 /data/spark-local

# Point Spark local storage at the new disk
sudo tee -a /opt/spark/conf/spark-defaults.conf >/dev/null <<'EOF'
spark.local.dir                  /data/spark-local
EOF

sudo systemctl start spark-master.service
sudo systemctl start spark-worker.service

spark.local.dir is where Spark writes shuffle blocks and cached RDD partitions when memory pressure forces a spill. Moving it off the OS disk is the single highest value change for shuffle heavy workloads.

Step 14: Enable Spark Authentication Before Production

The shipped image has Spark authentication disabled. Any client that can reach port 7077 can submit arbitrary jobs. Do not expose the master RPC port beyond the virtual machine itself until you have completed this step.

Spark standalone authentication uses a shared secret that every cluster participant must present:

# Generate a shared secret
SECRET="$(openssl rand -hex 32)"

# Append to spark-defaults.conf
sudo tee -a /opt/spark/conf/spark-defaults.conf >/dev/null <<EOF
spark.authenticate               true
spark.authenticate.secret        ${SECRET}
EOF
sudo chown spark:spark /opt/spark/conf/spark-defaults.conf
sudo chmod 0640 /opt/spark/conf/spark-defaults.conf

# Restart
sudo systemctl restart spark-master.service
sudo systemctl restart spark-worker.service

# Every spark-submit, pyspark, spark-shell, spark-sql invocation must now
# include --conf spark.authenticate=true --conf spark.authenticate.secret=${SECRET}

Store the secret somewhere safe. Any job that does not present it will be rejected at handshake time. The Apache Spark security documentation at https://spark.apache.org/docs/latest/security.html is the authoritative reference for the full set of options including TLS encryption, event log protection, and RPC SASL.

Step 15: Optional — Enable the Spark History Server

The shipped image does not run the Spark History Server (port 18080) by default to keep the attack surface small. To enable it after deploy:

# Create an event log directory
sudo mkdir -p /var/log/spark/events
sudo chown spark:spark /var/log/spark/events
sudo chmod 0755 /var/log/spark/events

# Turn on event logging
sudo tee -a /opt/spark/conf/spark-defaults.conf >/dev/null <<'EOF'
spark.eventLog.enabled           true
spark.eventLog.dir               file:/var/log/spark/events
spark.history.fs.logDirectory    file:/var/log/spark/events
EOF

# Start the history server as the spark user
sudo -u spark /opt/spark/sbin/start-history-server.sh

Then open an NSG rule for TCP 18080 scoped to your trusted management IP addresses and visit http://<vm-ip>:18080. To persist across reboots, wrap start-history-server.sh in a systemd unit modelled on spark-master.service.

Step 16: Troubleshooting

Master Web UI on port 8080 does not load. Check sudo systemctl status spark-master.service. Check ss -tlnp | grep :8080 on the virtual machine — the master should be listening on *:8080. Then confirm your network security group actually allows your client IP on TCP 8080 inbound, and that your public IP is the one in the allow rule (not a stale one).

Worker is not registering with the master. Look at /var/log/spark/spark-spark-org.apache.spark.deploy.worker.Worker-1-<hostname>.out or sudo journalctl -u spark-worker.service -n 200 --no-pager. The usual cause is a mismatched master URL — the worker unit uses spark://localhost:7077, and the master must be running on the same host.

spark-submit fails with "ClassNotFoundException: org.apache.spark.examples.SparkPi". The examples jar path expands with a glob so pin it to the actual version:

ls $SPARK_HOME/examples/jars/spark-examples*.jar

Use the full path printed above. If the file is missing, the install step was interrupted and you should redeploy the image rather than patching in place.

Java version mismatch. Spark 4.1 requires Java 17 or later. Run java -version — it should report 17.x. If not, sudo update-alternatives --config java and select /usr/lib/jvm/java-17-openjdk-amd64/bin/java.

Out of memory during spark-submit. The default driver is 1g and executor is 2g (see spark-defaults.conf). On Standard_D4s_v3 (16 GiB) you can raise spark.driver.memory to 4g and spark.executor.memory to 8g. For larger memory requirements move to Standard_D8s_v3 or a memory optimised family.

Step 17: Security Recommendations

  • Enable Spark authentication immediately after first boot if port 7077 will be reachable by anything other than the virtual machine itself (Section 14)
  • Restrict the network security group on TCP 8080 and TCP 4040 to the smallest possible source CIDR
  • Do not expose TCP 7077 beyond the virtual machine in standalone single node mode
  • Consider TLS encryption for RPC traffic per the Apache Spark security documentation at https://spark.apache.org/docs/latest/security.html when moving beyond single node deployments
  • Rotate the shared secret used for Spark authentication on a regular schedule
  • Take regular snapshots of the OS disk and any attached data disk using Azure Disk Snapshots
  • Subscribe to the Apache Spark announce mailing list and apply security patches as they are published

Step 18: Support and Licensing

Apache Spark is licensed under the Apache License 2.0. The full text is reproduced in /opt/spark/LICENSE on the deployed image. cloudimg distributes the unmodified upstream Apache Spark 4.1.0 (bin-hadoop3) binary tarball published by the Apache Software Foundation, with cloudimg authored configuration, systemd units, and first boot tooling layered on top.

For support with the cloudimg image itself contact support@cloudimg.co.uk or visit https://www.cloudimg.co.uk/support. For Apache Spark questions outside the scope of cloudimg packaging, the Apache Spark community resources at https://spark.apache.org/community.html are the authoritative reference.