Sqoop, Hive and Impala for Data Analysts (Formerly CCA 159)

Hands on Sqoop, Hive and Impala for Data Analysts

What you will learn

Overview of Big Data ecosystem such as Hadoop HDFS, YARN, Map Reduce, Sqoop, Hive, etc

Overview of HDFS Commands such as put or copyFromLocal, get or copyToLocal, cat, etc along with concepts such as block size, replication factor, etc

Managing Tables in Hive Metastore using DDL Commands

Load or Insert data into Hive Metastore Tables using commands such as LOAD and INSERT

Overview of Functions in Hive to manipulate strings, dates, etc

Writing Basic Hive QL Queries using WHERE, JOIN, GROUP BY, etc

Analytical or Windowing Functions in Hive

Overview of Impala and understanding similarities and differences between Hive and Impala

Getting Started with Sqoop by reviewing official documentation and also exploring commands such as Sqoop eval

Importing Data from RDBMS tables into HDFS using Sqoop Import

Importing Data from RDBMS tables into Hive tables using Sqoop Import

Exporting Data from Hive or HDFS to RDBMS tables using Sqoop Export

Incremental Imports using Sqoop Import into HDFS or Hive Tables

Description

As part of Sqoop, Hive, and Impala for Data Analysts (Formerly CCA 159), you will learn key skills such as Sqoop, Hive, and Impala.

This comprehensive course covers all aspects of the certification with real-world examples and data sets.

Overview of Big Data ecosystem

Overview Of Distributions and Management Tools
Properties and Properties Files – General Guidelines
Hadoop Distributed File System
YARN and Map Reduce2
Submitting Map ReduceJob
Determining Number of Mappers and Reducers
Understanding YARN and Map Reduce Configuration Properties
Review and Override Job Properties
Reviewing Map Reduce Job Logs
Map Reduce Job Counters
Overview of Hive
Databases and Query Engines
Overview of Data Ingestion in Big Data
Data Processing using Spark

HDFS Commands to manage files

Introduction to HDFS for Certification Exams
Overview of HDFS and PropertiesFiles
Overview of Hadoop CLI
Listing Files in HDFS
User Spaces or Home Directories in HDFS
Creating Directories in HDFS
Copying Files and Directories into HDFS
File and Directory Permissions Overview
Getting Files and Directories from HDFS
Previewing Text Files in HDFS
Copying or Moving Files and Directories within HDFS
Understanding Size of File System and Files
Overview of Block Size and ReplicationFactor
Getting File Metadata using hdfs fsck
Resources and Exercises

Getting Started with Hive

Overview of Hive Language Manual
Launching and using Hive CLI
Overview of Hive Properties
Hive CLI History and hiverc
Running HDFS Commands in Hive CLI
Understanding Warehouse Directory
Creating and Using Hive Databases
Creating and Describing Hive Tables
Retrieve Matadata of Tables using DESCRIBE
Role of Hive Metastore Database
Overview of beeline
Running Hive Commands and Queries using beeline

Creating Tables in Hive using Hive QL

Creating Tables in Hive – orders
Overview of Basic Data Types in Hive
Adding Comments to Columns and Tables
Loading Data into Hive Tables from Local File System
Loading Data into Hive Tables from HDFS
Loading Data – Overwrite vs Append
Creating External tables in Hive
Specifying Location for Hive Tables
Difference between Managed Table and External Table
Default Delimiters in Hive Tables using Text File
Overview of File Formats in Hive
Differences between Hive and RDBMS
Truncate and Drop tables in Hive
Resources and Exercises

Loading/Inserting data into Hive tables using Hive QL

Introduction to Partitioning and Bucketing
Creating Tables using Orc Format – order_items
Inserting Data into Tables using Stage Tables
Load vs. Insert in Hive
Creating Partitioned Tables in Hive
Adding Partitions to Tables in Hive
Loading into Partitions in Hive Tables
Inserting Data Into Partitions in Hive Tables
Insert Using Dynamic Partition Mode
Creating Bucketed Tables in Hive
Inserting Data into Bucketed Tables
Bucketing with Sorting
Overview of ACID Transactions
Create Tables for Transactions
Inserting Individual Records into Hive Tables
Update and Delete Data in Hive Tables

Overview of functions in Hive

Overview of Functions
Validating Functions
String Manipulation – Case Conversion and Length
String Manipulation – substr and split
String Manipulation – Trimming and Padding Functions
String Manipulation – Reverse and Concatenating Multiple Strings
Date Manipulation – Current Date and Timestamp
Date Manipulation – Date Arithmetic
Date Manipulation – trunc
Date Manipulation – Using date format
Date Manipulation – Extract Functions
Date Manipulation – Dealing with Unix Timestamp
Overview of Numeric Functions
Data Type Conversion Using Cast
Handling Null Values
Query Example – Get Word Count

Writing Basic Queries in Hive

Overview of SQL or Hive QL
Execution Life Cycle of Hive Query
Reviewing Logs of Hive Queries
Projecting Data using Select and Overview of From
Derive Conditional Values using CASE and WHEN
Projecting Distinct Values
Filtering Data using Where Clause
Boolean Operations in Where Clause
Boolean OR vs IN Operator
Filtering Data using LIKE Operator
Performing Basic Aggregations using Aggregate Functions
Performing Aggregations using GROUP BY
Filtering Aggregated Data Using HAVING
Global Sorting using ORDER BY
Overview of DISTRIBUTE BY
Sorting Data within Groups using SORT BY
Using CLUSTERED BY

Joining Data Sets and Set Operations in Hive

Overview of Nested Sub Queries
Nested Sub Queries – Using IN Operator
Nested Sub Queries – Using EXISTS Operator
Overview of Joins in Hive
Performing Inner Joins using Hive
Performing Outer Joins using Hive
Performing Full Outer Joins using Hive
Map Side Join and Reduce Side Join in Hive
Joining in Hive using Legacy Syntax
Cross Joins in Hive
Overview of Set Operations in Hive
Perform Set Union between two Hive Query Results
Set Operations – Intersect and Minus Not Supported

Windowing or Analytics Functions in Hive

Prepare HR Database in Hive with Employees Table
Overview of Analytics or Windowing Functions in Hive
Performing Aggregations using Hive Queries
Create Tables to Get Daily Revenue using CTAS in Hive
Getting Lead and Lag using Windowing Functions in Hive
Getting First and Last Values using Windowing Functions in Hive
Applying Rank using Windowing Functions in Hive
Applying Dense Rank using Windowing Functions in Hive
Applying Row Number using Windowing Functions in Hive
Difference Between rank, dense_rank, and row_number in Hive
Understanding the order of execution of Hive Queries
Overview of Nested Sub Queries in Hive
Filtering Data on Top of Window Functions in Hive
Getting Top 5 Products by Revenue for Each Day using Windowing Functions in Hive – Recap

Running Queries using Impala

Introduction to Impala
Role of Impala Daemons
Impala State Store and Catalog Server
Overview of Impala Shell
Relationship between Hive and Impala
Overview of Creating Databases and Tables using Impala
Loading and Inserting Data into Tables using Impala
Running Queries using Impala Shell
Reviewing Logs of Impala Queries
Synching Hive and Impala – Using Invalidate Metadata
Running Scripts using Impala Shell
Assignment – Using NYSE Data
Assignment – Solution

Getting Started with Sqoop

Get Instant Notification of New Courses on our Telegram channel.

Introduction to Sqoop
Validate Source Database – MySQL
Review JDBC Jar to Connect to MySQL
Getting Help using Sqoop CLI
Overview of Sqoop User Guide
Validate Sqoop and MySQL Integration using Sqoop List Databases
Listing Tables in Database using Sqoop
Run Queries in MySQL using Sqoop Eval
Understanding Logs in Sqoop
Redirecting Sqoop Job Logs into Log Files

Importing data from MySQL to HDFS using Sqoop Import

Overview of Sqoop Import Command
Import Orders using target-dir
Import Order Items using warehouse-dir
Managing HDFS Directories
Sqoop Import Execution Flow
Reviewing Logs of Sqoop Import
Sqoop Import Specifying Number of Mappers
Review the Output Files generated by Sqoop Import
Sqoop Import Supported File Formats
Validating avro files using Avro Tools
Sqoop Import Using Compression

Apache Sqoop – Importing Data into HDFS – Customizing

Introduction to customizing Sqoop Import
Sqoop Import by Specifying Columns
Sqoop import Using Boundary Query
Sqoop import while filtering Unnecessary Data
Sqoop Import Using Split By to distribute import using non default column
Getting Query Results using Sqoop eval
Dealing with tables with Composite Keys while using Sqoop Import
Dealing with tables with Non Numeric Key Fields while using Sqoop Import
Dealing with tables with No Key Fields while using Sqoop Import
Using autoreset-to-one-mapper to use only one mapper while importing data using Sqoop from tables with no key fields
Default Delimiters used by Sqoop Import for Text File Format
Specifying Delimiters for Sqoop Import using Text File Format
Dealing with Null Values using Sqoop Import
Import Mulitple Tables from source database using Sqoop Import

Importing data from MySQL to Hive Tables using Sqoop Import

Quick Overview of Hive
Create Hive Database for Sqoop Import
Create Empty Hive Table for Sqoop Import
Import Data into Hive Table from source database table using Sqoop Import
Managing Hive Tables while importing data using Sqoop Import using Overwrite
Managing Hive Tables while importing data using Sqoop Import – Errors Out If Table Already Exists
Understanding Execution Flow of Sqoop Import into Hive tables
Review Files generated by Sqoop Import in Hive Tables
Sqoop Delimiters vs Hive Delimiters
Different File Formats supported by Sqoop Import while importing into Hive Tables
Sqoop Import all Tables into Hive from source database

Exporting Data from HDFS/Hive to MySQL using Sqoop Export

Introduction to Sqoop Export
Prepare Data for Sqoop Export
Create Table in MySQL for Sqoop Export
Perform Simple Sqoop Export from HDFS to MySQL table
Understanding Execution Flow of Sqoop Export
Specifying Number of Mappers for Sqoop Export
Troubleshooting the Issues related to Sqoop Export
Merging or Upserting Data using Sqoop Export – Overview
Quick Overview of MySQL – Upsert using Sqoop Export
Update Data using Update Key using Sqoop Export
Merging Data using allowInsert in Sqoop Export
Specifying Columns using Sqoop Export
Specifying Delimiters using Sqoop Export
Using Stage Table for Sqoop Export

Submitting Sqoop Jobs and Incremental Sqoop Imports

Introduction to Sqoop Jobs
Adding Password File for Sqoop Jobs
Creating Sqoop Job
Run Sqoop Job
Overview of Incremental Loads using Sqoop
Incremental Sqoop Import – Using Where
Incremental Sqoop Import – Using Append Mode
Incremental Sqoop Import – Create Table
Incremental Sqoop Import – Create Sqoop Job
Incremental Sqoop Import – Execute Job
Incremental Sqoop Import – Add Additional Data
Incremental Sqoop Import – Rerun Job
Incremental Sqoop Import – Using Last Modified

Here are the objectives for this course.

Provide Structure to the Data

Use Data Definition Language (DDL) statements to create or alter structures in the metastore for use by Hive and Impala.

Create tables using a variety of data types, delimiters, and file formats
Create new tables using existing tables to define the schema
Improve query performance by creating partitioned tables in the metastore
Alter tables to modify the existing schema
Create views in order to simplify queries

Data Analysis

Use Query Language (QL) statements in Hive and Impala to analyze data on the cluster.

Prepare reports using SELECT commands including unions and subqueries
Calculate aggregate statistics, such as sums and averages, during a query
Create queries against multiple data sources by using join commands
Transform the output format of queries by using built-in functions
Perform queries across a group of rows using windowing functions

Exercises will be provided to have enough practice to get better at Sqoop as well as writing queries using Hive and Impala.

All the demos are given on our state-of-the-art Big Data cluster. If you do not have multi-node cluster, you can sign up for our labs and practice on our multi-node cluster. You will be able to practice Sqoop and Hive on the cluster.

English

language

Content

Introduction

CCA 159 Certification Exam – Overview

Tools for preparation

Getting Details about the Exam

Signing up for the Exam

Using Cloudera QuickStart VM

Download and Install Virtual Box

Setup Cloudera QuickStart VM

Overview of Cloudera QuickStart VM

Overview of MySQL Databases

Setup NYSE Database in MySQL

Overview of HDFS and Setup Datasets

Overview of Hive and Create External Table

Validate Sqoop

Using ITVersity labs

Signing up for the labs

Connecting to the gateway node of the cluster

Overview of HDFS in the cluster

Using Hive in the cluster

Understanding MySQL in the cluster

Running Sqoop Commands in the cluster

Overview of Big Data eco system

Overview of Distributions and Management Tools such as Ambari

Properties and Properties Files of Big Data Tools – General Guidelines

Hadoop Distributed File System – Quick Overview

Distributed Computing using YARN and Map Reduce 2 – Quick Overview

Submitting Map Reduce Job in YARN Framework

Determining Number of Mappers and Reducers

Understanding YARN and Map Reduce Configuration Properties

Reviewing and Overriding Map Reduce Job Run Time Properties

Reviewing Map Reduce Job Logs – using Resource Manager and Job History Server UI

Map Reduce Job Counters

Overview of Hive

Databases in Big Data and Query Engines

Overview of Data Ingestion Tools in Big Data

Overview of HDFS Commands

Introduction to HDFS for Certification Exams

Overview of HDFS and Properties Files

Overview of “hadoop fs” or “hdfs dfs” command

Listing Files in HDFS

User Spaces or Home Directories in HDFS

Creating Directory in HDFS

Copying Files and Directories into HDFS

File and Directory Permissions Overview

Getting Files and Directories from HDFS

Previewing Text Files in HDFS – cat and tail

Copying or Moving Files from one HDFS location to other HDFS location

Understanding Size of the File System and Data Sets – using df and du

Overview of Block Size and Replication Factor

Getting metadata of files using “hdfs fsck”

Resources and Exercises

Apache Hive – Getting Started

Overview of Hive Language Manual

Launching and Using Hive CLI

Overview of Hive Properties – SET and .hiverc

Hive CLI History and .hiverc

Running HDFS Commands using Hive CLI

Understanding Warehouse Directory

Creating Database in Hive and Switching to the Database

Creating First Table in Hive and list the tables

Retrieve metadata of Hive Tables using DESCRIBE (EXTENDED and FORMATTED)

Role of Hive Metastore

Overview of beeline – Alternative to Hive CLI

Running Hive Queries using Beeline

Apache Hive – Managing Tables in Hive

Create tables in Hive – orders

Overview of Data Types in Hive

Adding Comments to Columns and Tables

Loading Data into Hive Tables from Local File System

Loading Data into Hive Tables from HDFS Location

Loading Data into Hive Tables – Overwrite vs. Append

Creating External Tables in Hive

Specifying Location for Hive Tables

Managed Tables vs. External Tables

Default Delimiters in Hive Tables using Text File Format

Overview of File Formats – STORED AS Clause

Differences between Hive and RDBMS

Truncating and Dropping tables in Hive

Resources and Exercises

Apache Hive – Managing Tables in Hive – Partitioning and Bucketing

Introduction to Partitioning and Bucketing in Hive

Creating Tables using orc File Format – order_items

Inserting Data into order_items using stage table

Can we use LOAD Command to get data into order_items with orc file format?

Creating Partitioned Tables in Hive – orders_part with order_month as key

Adding Partitions to Tables in Hive

Loading into Partitions in Hive Tables

Inserting Data into Partitions in Hive Tables

Inserting data into Partitioned Tables – Using dynamic partition mode

Creating Bucketed Tables – orders_buck and order_items_buck

Inserting Data Into Bucketed Tables

Bucketing with Sorting

Overview of ACID Transactions in Hive

Create Tables for ACID Transactions

Inserting individual records into Hive Tables

Updating and Deleting data in Hive Bucketed Tables

Apache Hive – Overview of Functions

Overview of Functions

Validating Functions

String Manipulation – Case Conversion and Length

String Manipulation – substr and split

String Manipulation – trimming and padding Functions

String Manipulation – Reverse and Concatenating multiple strings

Date Manipulation – Getting Current Date and Timestamp

Date Manipulation – Date Arithmetic

Date Manipulation – trunc

Date Manipulation – Extracting information using date_format

Date Manipulation – Extracting information using year, month, day etc

Date Manipulation – Dealing with Unix Timestamp

Overview of Numeric Functions

Type Cast Functions for Data Type Conversion

Handling null values using nvl

Query Example – Get Word Count

Apache Hive – Writing Basic Queries

Overview of SQL

Hive Query – Execution Life Cycle

Reviewing Logs for Hive Queries

Projecting Data using SELECT and Overview of FROM Clause

Using CASE and WHEN as part of SELECT Clause

Projecting DISTINCT Values

Filtering Data using WHERE Clause

Boolean Operations such as OR and AND using multiple fields

Boolean OR vs. IN

Filtering data using LIKE Operator

Basic Aggregations using Aggregate Functions

Performing basic aggregations such as SUM, MIN, MAX etc using GROUP BY

Filtering post aggregation using HAVING

Global Sorting using ORDER BY

Overview of DISTRIBUTE BY

Sorting Data with in groups using DISTRIBUTE BY and SORT BY

Overview of CLUSTER BY

Apache Hive – Writing Basic Queries – Joins and Set Operations

Overview of Nested Sub Queries

Nested Sub Queries in WHERE Clause with IN or NOT IN

Nested Sub Queries in WHERE Clause with EXISTS or NOT EXISTS

Overview of Joins

Joining Multiple Tables in Hive

Outer Joins in Hive

Full Outer Joins in Hive

Map Side Join vs. Reduce Side Join

Joining using Legacy Syntax

Cartesian between two data sets

Overview of SET Operations

Perform Union between two Data Sets

Not Supported – Perform Intersection or Minus between two Data Sets

Apache Hive – Analytics or Windowing Functions

Prepare HR Database with employees table

Overview of Analytics Functions or Windowing Functions

Performing Aggregations

Create tables to get daily revenue and daily product revenue

Getting Lead and Lag using Windowing Functions – order by

Getting Lead and Lag using Windowing Functions – order by and partition by

Windowing Functions – Using first_value and last_value

Applying rank Function

Applying dense_rank Function

Applying row_number Function

Difference between rank, dense_rank and row_number

Understanding Order of Execution

Quick recap of Nested Sub Queries

Filtering data using fields derived using analytics or windowing functions

Running Queries using Impala

Introduction to Impala

Role of Impala Daemons

Impala State Store and Catalog Server

Overview of impala-shell

Relationship between Hive and Impala

Overview of Creating Databases and Tables in Hive

Loading and Inserting Data into Impala Tables

Running Queries using Impala Shell

Reviewing Logs of Impala Queries

Synching Hive Metadata with Impala – using INVALIDATE METADATA

Running Scripts using Impala Shell

Apache Sqoop – Getting Started

Introduction to Sqoop

Validate Source Database – MySQL

Review JDBC Jar file to connect to MySQL

Getting help of Sqoop using Command Line

Overview of Sqoop User Guide

Validate Sqoop and MySQL integration using “sqoop list-databases”

List tables in MySQL using “sqoop list-tables”

Run Queries in MySQL using “sqoop eval”

Understanding Logs in Sqoop

Redirecting Sqoop Logs into files

Apache Sqoop – Importing Data into HDFS

Overview of Sqoop Import Command

Perform Sqoop Import of orders – –table and –target-dir

Perform Sqoop import of order_items – –warehouse-dir

Sqoop Import – Managing HDFS Directories – append or overwrite or fail

Sqoop Import – Execution Flow

Reviewing logs of Sqoop Import

Sqoop Import – Specifying Number of Mappers

Review the Output Files

Sqoop Import – Supported File Formats

Validating avro Files using avro-tools

Sqoop Import – Using Compression

Apache Sqoop – Importing Data into HDFS – Customizing

Sqoop Import – Customizing – Introduction

Sqoop Import – Specifying Columns

Sqoop Import – Using boundary query

Sqoop Import – Filter unnecessary data

Sqoop Import – Using Split By

Sqoop Import – Importing Query Results

Sqoop Import – Dealing with Composite Keys

Sqoop Import – Dealing with Primary Key or Split By using Non Numeric Field

Sqoop Import – Dealing with Tables with out Primary Key

Sqoop Import – Autoreset to One Mapper

Sqoop Import – Default Delimiters using Text File Format

Sqoop Import – Specifying Delimiters – Import NYSE Data with t as delimiter

Sqoop Import – Dealing with NULL Values

Sqoop Import – import-all-tables

Apache Sqoop – Importing Data into Hive Tables

Sqoop Import – Importing Data into Hive tables – Overview

Quick Overview of Hive

Sqoop Import – Create Hive Database

Creating empty Hive Table using create-hive-table

Sqoop Import – Import orders table to Hive Database

Sqoop Import – Managing Table using Hive Import – Overwrite

Sqoop Import – Managing Table using Hive Import – Error out – create-hive-table

Sqoop Import – Understanding Execution Flow while importing into Hive Table

Sqoop Import – Review files in Hive Tables

Sqoop Delimiters vs. Hive Delimiters – Text Files

Sqoop Import – Hive File Formats

Sqoop Import all tables – Hive

Apache Sqoop – Exporting Data from HDFS to RDBMS

Introduction

Prepare data for Export

Creating Table in MySQL

Sqoop Export – Perform Simple Export – –table and –export-dir

Sqoop Export – Execution Flow

Sqoop Export – Specifying Number of Mappers

Sqoop Export – Troubleshooting the issues

Sqoop Export – Merging or Upserting Overview

Sqoop Export – Quick Overview of MySQL for Upsert

Sqoop Export – Using update-mode – update-only (default)

Sqoop Export – Using update-mode – allow-inseert

Sqoop Export – Specifying Columns

Sqoop Export – Specifying Delimiters

Sqoop Export – Using Stage Table

Apache Sqoop – Incremental Imports and Jobs

Overview of Sqoop Jobs

Adding Password File

Creating Sqoop Job

Running Sqoop Job

Overview of Incremental Imports

Incremental Import – Using where

Incremental Import – Append Mode

Incremental Import – Create training_orders_incr in retail_export

Incremental Import – Create Job

Incremental Import – Execute Job

Incremental Import – Add additional data (order_id > 30000)

Incremental Import – Rerun the job and validate results

Incremental Import – Using mode lastmodified

Enroll for Free

Overview of Big Data ecosystem such as Hadoop HDFS, YARN, Map Reduce, Sqoop, Hive, etc

Overview of HDFS Commands such as put or copyFromLocal, get or copyToLocal, cat, etc along with concepts such as block size, replication factor, etc

Managing Tables in Hive Metastore using DDL Commands

Load or Insert data into Hive Metastore Tables using commands such as LOAD and INSERT

Overview of Functions in Hive to manipulate strings, dates, etc

Writing Basic Hive QL Queries using WHERE, JOIN, GROUP BY, etc

Analytical or Windowing Functions in Hive

Overview of Impala and understanding similarities and differences between Hive and Impala

Getting Started with Sqoop by reviewing official documentation and also exploring commands such as Sqoop eval

Importing Data from RDBMS tables into HDFS using Sqoop Import

Importing Data from RDBMS tables into Hive tables using Sqoop Import

Exporting Data from Hive or HDFS to RDBMS tables using Sqoop Export

Incremental Imports using Sqoop Import into HDFS or Hive Tables

Introduction

Using Cloudera QuickStart VM

Using ITVersity labs

Overview of Big Data eco system

Overview of HDFS Commands

Apache Hive – Getting Started

Apache Hive – Managing Tables in Hive

Apache Hive – Managing Tables in Hive – Partitioning and Bucketing

Apache Hive – Overview of Functions

Apache Hive – Writing Basic Queries

Apache Hive – Writing Basic Queries – Joins and Set Operations

Apache Hive – Analytics or Windowing Functions

Running Queries using Impala

Apache Sqoop – Getting Started

Apache Sqoop – Importing Data into HDFS

Apache Sqoop – Importing Data into HDFS – Customizing

Apache Sqoop – Importing Data into Hive Tables

Apache Sqoop – Exporting Data from HDFS to RDBMS

Apache Sqoop – Incremental Imports and Jobs

💠 Follow this Video to Get Free Courses on Every Udemy Topics! 💠