Applications

Hadoop Big Data Stack User Guide

| Product: Hadoop Big Data Stack

Overview

The Hadoop Big Data Stack AMI by cloudimg provides a fully preconfigured Apache Hadoop ecosystem ready to run on Amazon EC2. This single node deployment includes Hadoop, Hive, HBase, Pig, Spark, and Zookeeper, giving you a complete big data processing and analytics platform that is operational within minutes of launch.

This AMI is ideal for development, testing, prototyping, and learning environments where you need the full Hadoop stack without the complexity of configuring each component from scratch. All components are installed under a dedicated /apps mount point with their own volumes, and the Hadoop services are managed through convenient start and stop scripts.

This guide walks you through connecting to the instance, initialising the Hadoop services, accessing the web management interfaces, and working with each component in the stack. Whether you are running MapReduce jobs, querying data with Hive, or processing streams with Spark, this guide provides everything you need to get started.

Visit www.cloudimg.co.uk to explore the full catalogue of preconfigured AMIs available on the AWS Marketplace.


Prerequisites

Before launching the Hadoop Big Data Stack AMI, ensure you have the following in place.

AWS Account You need an active AWS account with permissions to launch EC2 instances and manage security groups.

EC2 Key Pair Create or select an existing EC2 key pair in the region where you plan to launch the instance. This key pair is required for SSH authentication.

Security Group Configuration Your security group must allow inbound traffic on the following ports:

Protocol Type Port Description
SSH TCP 22 SSH connectivity for remote administration
TCP TCP 8088 Hadoop ResourceManager UI
TCP TCP 8042 Hadoop NodeManager UI

For additional Hadoop ecosystem web interfaces, you may also want to open:

Protocol Type Port Description
TCP TCP 9870 HDFS NameNode UI
TCP TCP 16010 HBase Master UI
TCP TCP 8080 Spark Master UI (if configured)
TCP TCP 2181 Zookeeper client port

Restrict all ports to your own IP address or a trusted CIDR range. These management interfaces should never be exposed to the public internet without additional authentication.

Minimum System Requirements

Minimum CPU Minimum RAM Required Disk Space
1 vCPU 1 GB 20 GB

While the minimum requirements are low, big data workloads benefit significantly from more resources. For anything beyond basic testing, consider a m5.xlarge or larger instance type with at least 16 GB of RAM.


Step by Step Setup

Step 1: Launch the Instance

  1. Open the AWS Marketplace listing for Hadoop Big Data Stack by cloudimg.
  2. Click Continue to Subscribe, then Continue to Configuration.
  3. Select your preferred AWS Region and instance type.
  4. On the launch page, choose your VPC, subnet, and assign the security group you prepared above.
  5. Select your EC2 key pair and launch the instance.

Step 2: Wait for Status Checks

Allow the EC2 instance to reach 2/2 status checks passed before attempting to connect. The instance runs an initial boot update script that applies the latest operating system patches, so the first boot may take a few minutes longer than usual.

If you attempt to connect before both status checks have passed, you may see errors such as:

Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
ec2-user@your-instance-ip's password:

This is expected behaviour during early boot. Wait for the status checks to complete and try again.

Step 3: Connect via SSH

Connect to the instance using your private key:

ssh -i /path/to/your-key.pem ec2-user@<PUBLIC_IP>

Replace <PUBLIC_IP> with the public IP address or public DNS name shown in the EC2 console.

Step 4: Switch to the Root User

Once connected as ec2-user, switch to the root user:

sudo su -

Step 5: Configure Hadoop SSH Keys

Before starting Hadoop services for the first time, you must run the SSH key setup script as the root user. Hadoop requires SSH key based communication between its components (NameNode, DataNode, TaskTracker, JobTracker) even on a single node deployment:

./stage/scripts/setup_ssh_hadoop_user.sh

This script configures the hadoop OS user with the necessary SSH keys for passwordless authentication between Hadoop components.

Step 6: Start All Hadoop Services

Switch to the hadoop user and start all services using the provided convenience script:

sudo su - hadoop
cd $HOME
. ./start-all.sh

This script starts HDFS (NameNode, DataNode), YARN (ResourceManager, NodeManager), and all supporting services.

Step 7: Verify the ResourceManager UI

Open a web browser and navigate to:

http://<PUBLIC_IP>:8088

You should see the Hadoop ResourceManager web interface showing the YARN cluster status, including node information, application statistics, and scheduler metrics.

Step 8: Verify the NodeManager UI

Navigate to the NodeManager interface:

http://<PUBLIC_IP>:8042

This page displays information about the local node including resource usage and container logs.


Server Components

The Hadoop Big Data Stack AMI includes the following preconfigured components. All big data components are installed under the /apps directory on a dedicated volume.

Component Version Software Home Description
Java 1.8 /apps/java Java Runtime Environment required by all Hadoop ecosystem components
Hadoop 3.3.4 /apps/hadoop Distributed storage (HDFS) and processing (MapReduce/YARN) framework
Apache Hive 3.1.3 /apps/apache-hive Data warehouse infrastructure for querying data stored in HDFS
Apache HBase 2.4.15 /apps/apache-hbase Distributed, scalable NoSQL database built on top of HDFS
Apache Pig 0.17 /apps/apache-pig High level platform for creating MapReduce programs using Pig Latin
Apache Spark 3.3.2 /apps/apache-spark Unified analytics engine for large scale data processing
Apache Zookeeper 3.7.1 /apps/apache-zookeeper Centralised service for configuration, synchronisation, and naming

Each component has its configuration files located in a conf subdirectory within its installation directory. Versions are subject to change on initial boot if the update script finds newer packages.


Filesystem Layout

The AMI uses dedicated mount points to separate the operating system from the big data components.

Mount Point Description
/ Root filesystem containing the operating system
/boot Operating system kernel files
/apps Big data components installation directory on a dedicated volume

Key directories and their purposes:

Path Purpose
/apps/java Java installation directory (JAVA_HOME)
/apps/hadoop Hadoop installation including HDFS and YARN binaries
/apps/hadoop/etc/hadoop Hadoop configuration files (core-site.xml, hdfs-site.xml, yarn-site.xml)
/apps/apache-hive Hive installation directory
/apps/apache-hive/conf Hive configuration files
/apps/apache-hbase HBase installation directory
/apps/apache-hbase/conf HBase configuration files
/apps/apache-pig Pig installation directory
/apps/apache-pig/conf Pig configuration files
/apps/apache-spark Spark installation directory
/apps/apache-spark/conf Spark configuration files
/apps/apache-zookeeper Zookeeper installation directory
/apps/apache-zookeeper/conf Zookeeper configuration files
/home/hadoop Home directory for the hadoop OS user, contains start/stop scripts
/stage/scripts cloudimg provisioning scripts and log files

The /apps volume is mounted on a separate EBS volume, allowing you to independently resize storage for your big data components and data without affecting the root filesystem.


Managing Services

All Hadoop ecosystem services are managed through the hadoop OS user. The AMI provides convenient start and stop scripts located in the hadoop user's home directory.

Starting All Hadoop Services

sudo su - hadoop
cd $HOME
. ./start-all.sh

This starts HDFS, YARN, and all related daemons for single node operation.

Stopping All Hadoop Services

sudo su - hadoop
cd $HOME
. ./stop-all.sh

Checking Individual Service Status

You can verify running Java processes (which includes all Hadoop daemons) using:

sudo su - hadoop
jps

The jps command should list processes such as NameNode, DataNode, ResourceManager, NodeManager, and any other active Hadoop daemons.

Verifying Java Version

java -version

Using System Components

Apache Hive

Apache Hive provides a SQL like interface for querying data stored in HDFS. Please type the following command manually (do not copy and paste) to verify the installed version:

hive --version

To start the Hive interactive shell:

sudo su - hadoop
hive

Configuration files are located at /apps/apache-hive/conf. You can customise Hive behaviour by editing hive-site.xml in that directory.

Apache HBase

Apache HBase is a distributed NoSQL database. Please type the following command manually (do not copy and paste) to launch the HBase shell:

hbase shell

From the HBase shell you can create tables, insert data, and run scans. Configuration files are located at /apps/apache-hbase/conf.

Apache Pig

Apache Pig provides a high level language (Pig Latin) for expressing data analysis programs. Please type the following command manually (do not copy and paste) to verify the installed version:

pig -version

To start the Pig interactive shell (Grunt):

sudo su - hadoop
pig

Configuration files are located at /apps/apache-pig/conf.

Apache Spark

Apache Spark is a unified analytics engine for large scale data processing. Please type the following command manually (do not copy and paste) to verify the installed version:

spark-submit --version

To launch the Spark interactive shell:

sudo su - hadoop
spark-shell

For PySpark (Python interface):

pyspark

Configuration files are located at /apps/apache-spark/conf.

Apache Zookeeper

Apache Zookeeper provides distributed coordination services. Please type the following command manually (do not copy and paste) to verify the installed version:

zkServer.sh version

To check the Zookeeper service status:

zkServer.sh status

Configuration files are located at /apps/apache-zookeeper/conf.


Scripts and Log Files

The AMI includes several scripts and log files created by cloudimg to streamline provisioning and service management.

Script or Log Path Description
initial_boot_update.sh /stage/scripts Updates the operating system with the latest available patches on first boot
initial_boot_update.log /stage/scripts Output log from the initial boot update script
setup_ssh_hadoop_user.sh /stage/scripts Configures SSH keys for the hadoop OS user (must be run as root before starting services)
start_all.sh /home/hadoop Starts all Hadoop services for single node operation
stop_all.sh /home/hadoop Stops all Hadoop services for single node operation

Disabling the Initial Boot Update Script

The OS update script runs automatically on every reboot via crontab. If you prefer to manage updates manually, you can disable it:

rm -f /stage/scripts/initial_boot_update.sh

crontab -e
# DELETE THE BELOW LINE, SAVE AND EXIT THE FILE:
@reboot /stage/scripts/initial_boot_update.sh

Web Management Interfaces

The Hadoop ecosystem provides several web interfaces for monitoring and management. All URLs use the public or private IP of your EC2 instance.

Interface URL Description
ResourceManager UI http://PUBLIC_IP:8088 YARN cluster overview, application status, node information, and scheduler metrics
NodeManager UI http://PUBLIC_IP:8042 Individual node details including resource usage and container logs

After starting all Hadoop services with the start-all.sh script, these interfaces should be immediately accessible from your browser provided the corresponding ports are open in your security group.


Troubleshooting

Cannot connect via SSH

  • Confirm the instance has reached 2/2 status checks in the EC2 console.
  • Verify your security group allows inbound TCP traffic on port 22 from your IP address.
  • Ensure you are using the correct key pair and connecting as ec2-user.
  • Check that the key file permissions are set correctly: chmod 400 /path/to/your-key.pem.

Hadoop services fail to start

  • Ensure you ran the SSH key setup script as root first: ./stage/scripts/setup_ssh_hadoop_user.sh.
  • Verify you are starting services as the hadoop user, not as root or ec2-user.
  • Check that the start script is sourced correctly using . ./start-all.sh (note the leading dot and space).
  • Review Hadoop logs in /apps/hadoop/logs/ for specific error messages.

ResourceManager or NodeManager UI not accessible

  • Confirm Hadoop services are running by checking jps output as the hadoop user.
  • Verify your security group allows inbound traffic on ports 8088 and 8042.
  • Check that services are bound to the correct network interface and not just localhost.

Hive, HBase, or other component commands not found

  • Ensure you are running as the hadoop user: sudo su - hadoop.
  • Verify the PATH includes the relevant component bin directories.
  • Check that the component installation exists under /apps/.

Out of memory errors during job execution

  • This AMI supports a minimum of 1 GB RAM, but big data workloads typically need much more.
  • Consider resizing the EC2 instance to a type with more memory (m5.xlarge or larger).
  • Adjust YARN memory settings in /apps/hadoop/etc/hadoop/yarn-site.xml.

HDFS reports no space available

  • Check disk usage with df -h and verify the /apps volume has available space.
  • Consider resizing the EBS volume attached to /apps if more storage is needed.
  • Clean up temporary files or old job outputs in HDFS using hdfs dfs -rm -r /path/to/old/data.

Security Recommendations

Restrict Web Interface Access

The Hadoop web interfaces (ports 8088, 8042) do not have built in authentication by default. Only open these ports to trusted IP addresses in your security group. Never expose them to 0.0.0.0/0.

Use SSH Tunnelling for Web Interfaces

Instead of opening Hadoop ports directly, use SSH tunnelling to securely access the web UIs:

ssh -i /path/to/your-key.pem -L 8088:localhost:8088 -L 8042:localhost:8042 ec2-user@<PUBLIC_IP>

Then access the interfaces at http://localhost:8088 and http://localhost:8042 from your local browser.

Restrict SSH Access

Limit SSH (port 22) to specific trusted IP addresses. Consider using AWS Systems Manager Session Manager as an alternative to direct SSH access.

Apply Security Updates Regularly

Keep the operating system and all packages up to date:

sudo yum update -y

Protect the Hadoop User

The hadoop OS user owns all the big data components and services. Ensure that only authorised administrators can switch to this user. Review sudoers configuration to limit access.

Enable Hadoop Security Features

For production environments, consider enabling Kerberos authentication for the Hadoop cluster. This adds strong authentication between all Hadoop components and prevents unauthorised access to HDFS data.

Backup HDFS Data

Regularly back up important HDFS data. You can use distcp to copy data to Amazon S3:

hadoop distcp hdfs:///important-data s3a://your-backup-bucket/hadoop-backup/

Also consider using AWS EBS snapshots for volume level backups of the /apps partition.


Support

If you encounter any issues not covered in this guide or need further assistance, the cloudimg support team is available 24/7.

Email: support@cloudimg.co.uk Phone: (+44) 02045382725 Website: www.cloudimg.co.uk Address: 3rd Floor, 86 90 Paul Street, London, EC2A 4NE

When contacting support, please include your EC2 instance ID, the AWS region, and a description of the issue along with any relevant log output.