• Post category:StudyBullet-12
  • Reading time:17 mins read


Step by step instructions to setup Hadoop and Spark Cluster using Cloudera Distribution of Hadoop (Formerly CCA 131)

What you will learn

Learn Hadoop and Spark Administration using CDH

Provision Cluster from GCP (Google Cloud Platform) to setup Hadoop and Spark Cluster using CDH

Setup Ansible for server automation to setup pre-requisites to setup Hadoop and Spark Cluster using CDH

Setup 8 node cluster from scratch using CDH

Understand Architecture of HDFS, YARN, Spark, Hive, Hue and many more

Description

Cloudera is one of the leading vendor for distributions related to Hadoop and Spark. As part of this Practical Guide, you will learn step by step process of setting up Hadoop and Spark Cluster using CDH.

Install – Demonstrate an understanding of the installation process for Cloudera Manager, CDH, and the ecosystem projects.

  • Set up a local CDH repository
  • Perform OS-level configuration for Hadoop installation
  • Install Cloudera Manager server and agents
  • Install CDH using Cloudera Manager
  • Add a new node to an existing cluster
  • Add a service using Cloudera Manager

Configure – Perform basic and advanced configuration needed to effectively administer a Hadoop cluster

  • Configure a service using Cloudera Manager
  • Create an HDFS user’s home directory
  • Configure NameNode HA
  • Configure ResourceManager HA
  • Configure proxy for Hiveserver2/Impala

Manage – Maintain and modify the cluster to support day-to-day operations in the enterprise


Get Instant Notification of New Courses on our Telegram channel.


  • Rebalance the cluster
  • Set up alerting for excessive disk fill
  • Define and install a rack topology script
  • Install new type of I/O compression library in cluster
  • Revise YARN resource assignment based on user feedback
  • Commission/decommission a node

Secure – Enable relevant services and configure the cluster to meet goals defined by security policy; demonstrate knowledge of basic security practices

  • Configure HDFS ACLs
  • Install and configure Sentry
  • Configure Hue user authorization and authentication
  • Enable/configure log and query redaction
  • Create encrypted zones in HDFS

Test – Benchmark the cluster operational metrics, test system configuration for operation and efficiency

  • Execute file system commands via HTTPFS
  • Efficiently copy data within a cluster/between clusters
  • Create/restore a snapshot of an HDFS directory
  • Get/set ACLs for a file or directory structure
  • Benchmark the cluster (I/O, CPU, network)

Troubleshoot – Demonstrate ability to find the root cause of a problem, optimize inefficient execution, and resolve resource contention scenarios

  • Resolve errors/warnings in Cloudera Manager
  • Resolve performance problems/errors in cluster operation
  • Determine reason for application failure
  • Configure the Fair Scheduler to resolve application delays

Our Approach

  • You will start with creating Cloudera QuickStart VM (in case you have laptop with 16 GB RAM with Quad Core). This will facilitate you to get comfortable with Cloudera Manager.
  • You will be able to sign up for GCP and avail credit up to $300 while offer lasts. Credits are valid up to year.
  • You will then understand brief overview about GCP and provision 7 to 8 Virtual Machines using templates. You will also attaching external hard drive to configure for HDFS later.
  • Once servers are provisioned, you will go ahead and set up Ansible for Server Automation.
  • You will take care of local repository for Cloudera Manager and Cloudera Distribution of Hadoop using Packages.
  • You will then setup Cloudera Manager with custom database and then Cloudera Distribution of Hadoop using Wizard that comes as part of Cloudera Manager.
  • As part of setting up of Cloudera Distribution of Hadoop you will setup HDFS, learn HDFS Commands, Setup YARN, Configure HDFS and YARN High Availability, Understand about Schedulers, Setup Spark, Transition to Parcels, Setup Hive and Impala, Setup HBase and Kafka etc.
English
language

Content

Introduction – CCA 131 Cloudera Certified Hadoop and Spark Administrator

Introduction to the course
CCA 131 – Administrator – Official Page
Understanding required skills for the certification
Understanding the environment provided while taking the exam
Signing up for the exam

Getting Started – Cloudera QuickStart VM – Overview

Introduction
Pre-requisites
Setting up VirtualBox
Cloudera QuickStart VM – Create
Cloudera QuickStart VM – Quick Tour

Getting Started – Provision instances from Google Cloud

Introduction
Setup Ubuntu using Windows Subsystem
Sign up for GCP
Create template for Big Data Server
Provision Servers for Big Data Cluster
Review Concepts
Setting up gcloud
Setup ansible on first server
Format JBOD
Cluster Topology

Getting Started – Setup local yum repository server – CDH

Introduction
Overview of yum
Setup httpd service
Setup local yum repository – Cloudera Manager
Setup local yum repository – Cloudera Distribution of Hadoop (CDH)
Copy repo files

Install CM and CDH – Setup CM, Install CDH and Setup Cloudera Management Service

Introduction
Setup Pre-requisites
Install Cloudera Manager
Licensing and Installation Options
Install CM and CDH on all nodes
CM Agents and CM Server
Setup Cloudera Management Service
Cloudera Management Service – Components

Install CM and CDH – Configure Zookeeper

Introduction
Learning Process
Setup Zookeeper
Review important properties
Zookeeper Concepts
Important Zookeeper Commands

Install CM and CDH – Configure HDFS and Understand Concepts

Introduction
Setup HDFS
Copy Data into HDFS
Copy Data into HDFS Contd
Components of HDFS
Components of HDFS Contd
Configuration files and Important Properties
Review Web UIs and log files
Checkpointing
Checkpointing Contd
Namenode Recovery Process
Configure Rack Awareness

Install CM and CDH – Important HDFS Commands

Introduction
Getting list of commands and help
Creating Directories and Changing Ownership
Managing Files and File Permissions – Deleting Files from HDFS
Managing Files and File Permissions – Copying Files Local File System and HDFS
Managing Files and File Permissions – Copying Files within HDFS
Managing Files and File Permissions – Previewing Data in HDFS
Managing Files and File Permissions – Changing File Permissions
Controlling Access using ACLs – Enable ACLs On Cluster
Controlling Access using ACLs – ACLs On Files
Controlling Access using ACLs – ACLs On Directories
Controlling Access using ACLs – Removing ACLs
Overriding Properties
HDFS usage commands and getting metadata
Creating Snapshots
Using CLI for administration

Install CM and CDH – Configure YARN + MRv2 and Understand Concepts

Introduction
Setup YARN + MR2
Run Simple Map Reduce Job
Components of YARN and MR2
Configuration files and Important Properties – Overview
Configuration files and Important Properties – Review YARN Properties
Configuration files and Important Properties – Review Map Reduce Properties
Configuration files and Important Properties – Running Jobs
Review Web UIs and log files
YARN and MR2 CLI
YARN Application Life Cycle
Map Reduce Job Execution Life Cycle

Install CM and CDH – Configuring HDFS and YARN HA

Introduction
High Availability – Overview
Configure HDFS Namenode HA
Review Properties – HDFS Namenode HA
HDFS Namenode HA – Quick Recap of HDFS typical Configuration
HDFS Namenode HA – Components
HDFS Namenode HA – Automatic failover
Configure YARN Resource Manager HA
Review – YARN Resource Manager HA
High Availability – Implications

Install CM and CDH – YARN Schedulers – FIFO, Fair, and Capacity

Introduction
Schedulers Overview
FIFO Scheduler
Introduction to Fair Scheduler
Configure Fair Scheduler – Configure Cluster with Fair Scheduler
Configure Fair Scheduler – Running Jobs Without Specifying Queue
Configure Fair Scheduler – Running Jobs Specifying Queue
Configure Fair Scheduler – Important Properties
Capacity Scheduler – Introduction
Capacity Scheduler – Configure using Cloudera Manager
Capacity Scheduler – Run Sample Jobs

Install Other Components – Spark Overview and Installation

Introduction
Setup and Validate Spark 1.6.x
Review Important Properties
Spark Execution Life Cycle
Convert Cluster to Parcels
Setup Spark 2.3.x
Run Spark Jobs – Spark 2.3.x

Install Other Components – Configuring Database Engines – Hive and Impala

Introduction
Setup Hive and Impala
Validating Hive and Impala
Components and Properties of Hive
Troubleshooting Hive Issues
Hive Commands and Queries
Different Query Engines
Components and Properties of Impala
Running Queries using Impala – Overview

Install Other Components – Configure Hadoop Ecosystem components

Introduction
Setup Oozie, Pig, Sqoop and Hue
Review Important Properties
Run Sample Oozie job
Run Pig Job
Validate Sqoop
Overview of Hue

Install Other Components – Install and Configure Kafka and HBase

Introduction
Kafka Overview
Setup Parcels and Add Kafka Service
Validate Kafka
Setting up HBase
Validate HBase

CCA 131 – Revision for the Exam – Install the Cluster

Introduction
Set up a local CDH Repository
Perform OS-level Configuration
Install Cloudera Manager Server and Agents
Install CDH using Cloudera Manager
Add a New Node to an Existing Cluster
Install – Add Host as Worker
Add a Service using Cloudera Manager

CCA 131 – Revision for the Exam – Configure the Cluster

Introduction
Configure a Service using Cloudera Manager
Create an HDFS user’s home directory
Configure NameNode HA
Configure ResourceManager HA
Configure proxy for HiveServer2/Impala – Install HA Proxy
Configure proxy for HiveServer2
Configure proxy for Impala

CCA 131 – Revision for the Exam – Manage the Cluster

Introduction
Rebalance the cluster
Set up alerting for excessive disk fill
Define and install a rack topology script
Add I/O Compression Library
YARN Resource Assignment
Commission/Decommission a node

CCA 131 – Revision for the Exam – Secure the Cluster

Introduction
Configure HDFS ACLs
Install and Configure Sentry
Configure Hue user authorization and authentication
Enable or Configure Log and Query Redaction
Create Encrypted Zones in HDFS – Enable Encryption
Create Encrypted Zones in HDFS – Create Encryption Keys and Zones

CCA 131 – Revision for the Exam – Test and Troubleshoot the Cluster

Introduction
Execute file system commands via HTTPFS
Efficiently copy data within a cluster
Efficiently copy data between clusters
Create/Restore a snapshot of an HDFS directory
Get/Set ACLs for a file or directory structure
Benchmark the cluster (I/O, CPU, network)
Resolve errors/warnings in Cloudera Manager
Resolve performance problems/errors in cluster operation