Hadoop Big Data Stack User Guide
Overview
The Hadoop Big Data Stack AMI by cloudimg provides a fully preconfigured Apache Hadoop ecosystem ready to run on Amazon EC2. This single node deployment includes Hadoop, Hive, HBase, Pig, Spark, and Zookeeper, giving you a complete big data processing and analytics platform that is operational within minutes of launch.
This AMI is ideal for development, testing, prototyping, and learning environments where you need the full Hadoop stack without the complexity of configuring each component from scratch. All components are installed under a dedicated /apps mount point with their own volumes, and the Hadoop services are managed through convenient start and stop scripts.
This guide walks you through connecting to the instance, initialising the Hadoop services, accessing the web management interfaces, and working with each component in the stack. Whether you are running MapReduce jobs, querying data with Hive, or processing streams with Spark, this guide provides everything you need to get started.
Visit www.cloudimg.co.uk to explore the full catalogue of preconfigured AMIs available on the AWS Marketplace.
Prerequisites
Before launching the Hadoop Big Data Stack AMI, ensure you have the following in place.
AWS Account You need an active AWS account with permissions to launch EC2 instances and manage security groups.
EC2 Key Pair Create or select an existing EC2 key pair in the region where you plan to launch the instance. This key pair is required for SSH authentication.
Security Group Configuration Your security group must allow inbound traffic on the following ports:
| Protocol | Type | Port | Description |
|---|---|---|---|
| SSH | TCP | 22 | SSH connectivity for remote administration |
| TCP | TCP | 8088 | Hadoop ResourceManager UI |
| TCP | TCP | 8042 | Hadoop NodeManager UI |
For additional Hadoop ecosystem web interfaces, you may also want to open:
| Protocol | Type | Port | Description |
|---|---|---|---|
| TCP | TCP | 9870 | HDFS NameNode UI |
| TCP | TCP | 16010 | HBase Master UI |
| TCP | TCP | 8080 | Spark Master UI (if configured) |
| TCP | TCP | 2181 | Zookeeper client port |
Restrict all ports to your own IP address or a trusted CIDR range. These management interfaces should never be exposed to the public internet without additional authentication.
Minimum System Requirements
| Minimum CPU | Minimum RAM | Required Disk Space |
|---|---|---|
| 1 vCPU | 1 GB | 20 GB |
While the minimum requirements are low, big data workloads benefit significantly from more resources. For anything beyond basic testing, consider a m5.xlarge or larger instance type with at least 16 GB of RAM.
Step by Step Setup
Step 1: Launch the Instance
- Open the AWS Marketplace listing for Hadoop Big Data Stack by cloudimg.
- Click Continue to Subscribe, then Continue to Configuration.
- Select your preferred AWS Region and instance type.
- On the launch page, choose your VPC, subnet, and assign the security group you prepared above.
- Select your EC2 key pair and launch the instance.
Step 2: Wait for Status Checks
Allow the EC2 instance to reach 2/2 status checks passed before attempting to connect. The instance runs an initial boot update script that applies the latest operating system patches, so the first boot may take a few minutes longer than usual.
If you attempt to connect before both status checks have passed, you may see errors such as:
Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
ec2-user@your-instance-ip's password:
This is expected behaviour during early boot. Wait for the status checks to complete and try again.
Step 3: Connect via SSH
Connect to the instance using your private key:
ssh -i /path/to/your-key.pem ec2-user@<PUBLIC_IP>
Replace <PUBLIC_IP> with the public IP address or public DNS name shown in the EC2 console.
Step 4: Switch to the Root User
Once connected as ec2-user, switch to the root user:
sudo su -
Step 5: Configure Hadoop SSH Keys
Before starting Hadoop services for the first time, you must run the SSH key setup script as the root user. Hadoop requires SSH key based communication between its components (NameNode, DataNode, TaskTracker, JobTracker) even on a single node deployment:
./stage/scripts/setup_ssh_hadoop_user.sh
This script configures the hadoop OS user with the necessary SSH keys for passwordless authentication between Hadoop components.
Step 6: Start All Hadoop Services
Switch to the hadoop user and start all services using the provided convenience script:
sudo su - hadoop
cd $HOME
. ./start-all.sh
This script starts HDFS (NameNode, DataNode), YARN (ResourceManager, NodeManager), and all supporting services.
Step 7: Verify the ResourceManager UI
Open a web browser and navigate to:
http://<PUBLIC_IP>:8088
You should see the Hadoop ResourceManager web interface showing the YARN cluster status, including node information, application statistics, and scheduler metrics.
Step 8: Verify the NodeManager UI
Navigate to the NodeManager interface:
http://<PUBLIC_IP>:8042
This page displays information about the local node including resource usage and container logs.
Server Components
The Hadoop Big Data Stack AMI includes the following preconfigured components. All big data components are installed under the /apps directory on a dedicated volume.
| Component | Version | Software Home | Description |
|---|---|---|---|
| Java | 1.8 | /apps/java | Java Runtime Environment required by all Hadoop ecosystem components |
| Hadoop | 3.3.4 | /apps/hadoop | Distributed storage (HDFS) and processing (MapReduce/YARN) framework |
| Apache Hive | 3.1.3 | /apps/apache-hive | Data warehouse infrastructure for querying data stored in HDFS |
| Apache HBase | 2.4.15 | /apps/apache-hbase | Distributed, scalable NoSQL database built on top of HDFS |
| Apache Pig | 0.17 | /apps/apache-pig | High level platform for creating MapReduce programs using Pig Latin |
| Apache Spark | 3.3.2 | /apps/apache-spark | Unified analytics engine for large scale data processing |
| Apache Zookeeper | 3.7.1 | /apps/apache-zookeeper | Centralised service for configuration, synchronisation, and naming |
Each component has its configuration files located in a conf subdirectory within its installation directory. Versions are subject to change on initial boot if the update script finds newer packages.
Filesystem Layout
The AMI uses dedicated mount points to separate the operating system from the big data components.
| Mount Point | Description |
|---|---|
| / | Root filesystem containing the operating system |
| /boot | Operating system kernel files |
| /apps | Big data components installation directory on a dedicated volume |
Key directories and their purposes:
| Path | Purpose |
|---|---|
| /apps/java | Java installation directory (JAVA_HOME) |
| /apps/hadoop | Hadoop installation including HDFS and YARN binaries |
| /apps/hadoop/etc/hadoop | Hadoop configuration files (core-site.xml, hdfs-site.xml, yarn-site.xml) |
| /apps/apache-hive | Hive installation directory |
| /apps/apache-hive/conf | Hive configuration files |
| /apps/apache-hbase | HBase installation directory |
| /apps/apache-hbase/conf | HBase configuration files |
| /apps/apache-pig | Pig installation directory |
| /apps/apache-pig/conf | Pig configuration files |
| /apps/apache-spark | Spark installation directory |
| /apps/apache-spark/conf | Spark configuration files |
| /apps/apache-zookeeper | Zookeeper installation directory |
| /apps/apache-zookeeper/conf | Zookeeper configuration files |
| /home/hadoop | Home directory for the hadoop OS user, contains start/stop scripts |
| /stage/scripts | cloudimg provisioning scripts and log files |
The /apps volume is mounted on a separate EBS volume, allowing you to independently resize storage for your big data components and data without affecting the root filesystem.
Managing Services
All Hadoop ecosystem services are managed through the hadoop OS user. The AMI provides convenient start and stop scripts located in the hadoop user's home directory.
Starting All Hadoop Services
sudo su - hadoop
cd $HOME
. ./start-all.sh
This starts HDFS, YARN, and all related daemons for single node operation.
Stopping All Hadoop Services
sudo su - hadoop
cd $HOME
. ./stop-all.sh
Checking Individual Service Status
You can verify running Java processes (which includes all Hadoop daemons) using:
sudo su - hadoop
jps
The jps command should list processes such as NameNode, DataNode, ResourceManager, NodeManager, and any other active Hadoop daemons.
Verifying Java Version
java -version
Using System Components
Apache Hive
Apache Hive provides a SQL like interface for querying data stored in HDFS. Please type the following command manually (do not copy and paste) to verify the installed version:
hive --version
To start the Hive interactive shell:
sudo su - hadoop
hive
Configuration files are located at /apps/apache-hive/conf. You can customise Hive behaviour by editing hive-site.xml in that directory.
Apache HBase
Apache HBase is a distributed NoSQL database. Please type the following command manually (do not copy and paste) to launch the HBase shell:
hbase shell
From the HBase shell you can create tables, insert data, and run scans. Configuration files are located at /apps/apache-hbase/conf.
Apache Pig
Apache Pig provides a high level language (Pig Latin) for expressing data analysis programs. Please type the following command manually (do not copy and paste) to verify the installed version:
pig -version
To start the Pig interactive shell (Grunt):
sudo su - hadoop
pig
Configuration files are located at /apps/apache-pig/conf.
Apache Spark
Apache Spark is a unified analytics engine for large scale data processing. Please type the following command manually (do not copy and paste) to verify the installed version:
spark-submit --version
To launch the Spark interactive shell:
sudo su - hadoop
spark-shell
For PySpark (Python interface):
pyspark
Configuration files are located at /apps/apache-spark/conf.
Apache Zookeeper
Apache Zookeeper provides distributed coordination services. Please type the following command manually (do not copy and paste) to verify the installed version:
zkServer.sh version
To check the Zookeeper service status:
zkServer.sh status
Configuration files are located at /apps/apache-zookeeper/conf.
Scripts and Log Files
The AMI includes several scripts and log files created by cloudimg to streamline provisioning and service management.
| Script or Log | Path | Description |
|---|---|---|
| initial_boot_update.sh | /stage/scripts | Updates the operating system with the latest available patches on first boot |
| initial_boot_update.log | /stage/scripts | Output log from the initial boot update script |
| setup_ssh_hadoop_user.sh | /stage/scripts | Configures SSH keys for the hadoop OS user (must be run as root before starting services) |
| start_all.sh | /home/hadoop | Starts all Hadoop services for single node operation |
| stop_all.sh | /home/hadoop | Stops all Hadoop services for single node operation |
Disabling the Initial Boot Update Script
The OS update script runs automatically on every reboot via crontab. If you prefer to manage updates manually, you can disable it:
rm -f /stage/scripts/initial_boot_update.sh
crontab -e
# DELETE THE BELOW LINE, SAVE AND EXIT THE FILE:
@reboot /stage/scripts/initial_boot_update.sh
Web Management Interfaces
The Hadoop ecosystem provides several web interfaces for monitoring and management. All URLs use the public or private IP of your EC2 instance.
| Interface | URL | Description |
|---|---|---|
| ResourceManager UI | http://PUBLIC_IP:8088 | YARN cluster overview, application status, node information, and scheduler metrics |
| NodeManager UI | http://PUBLIC_IP:8042 | Individual node details including resource usage and container logs |
After starting all Hadoop services with the start-all.sh script, these interfaces should be immediately accessible from your browser provided the corresponding ports are open in your security group.
Troubleshooting
Cannot connect via SSH
- Confirm the instance has reached 2/2 status checks in the EC2 console.
- Verify your security group allows inbound TCP traffic on port 22 from your IP address.
- Ensure you are using the correct key pair and connecting as
ec2-user. - Check that the key file permissions are set correctly:
chmod 400 /path/to/your-key.pem.
Hadoop services fail to start
- Ensure you ran the SSH key setup script as root first:
./stage/scripts/setup_ssh_hadoop_user.sh. - Verify you are starting services as the
hadoopuser, not as root or ec2-user. - Check that the start script is sourced correctly using
. ./start-all.sh(note the leading dot and space). - Review Hadoop logs in
/apps/hadoop/logs/for specific error messages.
ResourceManager or NodeManager UI not accessible
- Confirm Hadoop services are running by checking
jpsoutput as the hadoop user. - Verify your security group allows inbound traffic on ports 8088 and 8042.
- Check that services are bound to the correct network interface and not just localhost.
Hive, HBase, or other component commands not found
- Ensure you are running as the
hadoopuser:sudo su - hadoop. - Verify the PATH includes the relevant component bin directories.
- Check that the component installation exists under
/apps/.
Out of memory errors during job execution
- This AMI supports a minimum of 1 GB RAM, but big data workloads typically need much more.
- Consider resizing the EC2 instance to a type with more memory (m5.xlarge or larger).
- Adjust YARN memory settings in
/apps/hadoop/etc/hadoop/yarn-site.xml.
HDFS reports no space available
- Check disk usage with
df -hand verify the/appsvolume has available space. - Consider resizing the EBS volume attached to
/appsif more storage is needed. - Clean up temporary files or old job outputs in HDFS using
hdfs dfs -rm -r /path/to/old/data.
Security Recommendations
Restrict Web Interface Access
The Hadoop web interfaces (ports 8088, 8042) do not have built in authentication by default. Only open these ports to trusted IP addresses in your security group. Never expose them to 0.0.0.0/0.
Use SSH Tunnelling for Web Interfaces
Instead of opening Hadoop ports directly, use SSH tunnelling to securely access the web UIs:
ssh -i /path/to/your-key.pem -L 8088:localhost:8088 -L 8042:localhost:8042 ec2-user@<PUBLIC_IP>
Then access the interfaces at http://localhost:8088 and http://localhost:8042 from your local browser.
Restrict SSH Access
Limit SSH (port 22) to specific trusted IP addresses. Consider using AWS Systems Manager Session Manager as an alternative to direct SSH access.
Apply Security Updates Regularly
Keep the operating system and all packages up to date:
sudo yum update -y
Protect the Hadoop User
The hadoop OS user owns all the big data components and services. Ensure that only authorised administrators can switch to this user. Review sudoers configuration to limit access.
Enable Hadoop Security Features
For production environments, consider enabling Kerberos authentication for the Hadoop cluster. This adds strong authentication between all Hadoop components and prevents unauthorised access to HDFS data.
Backup HDFS Data
Regularly back up important HDFS data. You can use distcp to copy data to Amazon S3:
hadoop distcp hdfs:///important-data s3a://your-backup-bucket/hadoop-backup/
Also consider using AWS EBS snapshots for volume level backups of the /apps partition.
Support
If you encounter any issues not covered in this guide or need further assistance, the cloudimg support team is available 24/7.
Email: support@cloudimg.co.uk Phone: (+44) 02045382725 Website: www.cloudimg.co.uk Address: 3rd Floor, 86 90 Paul Street, London, EC2A 4NE
When contacting support, please include your EC2 instance ID, the AWS region, and a description of the issue along with any relevant log output.