• Post category:StudyBullet-12
  • Reading time:32 mins read


Hands on Sqoop, Hive and Impala for Data Analysts

What you will learn

Overview of Big Data ecosystem such as Hadoop HDFS, YARN, Map Reduce, Sqoop, Hive, etc

Overview of HDFS Commands such as put or copyFromLocal, get or copyToLocal, cat, etc along with concepts such as block size, replication factor, etc

Managing Tables in Hive Metastore using DDL Commands

Load or Insert data into Hive Metastore Tables using commands such as LOAD and INSERT

Overview of Functions in Hive to manipulate strings, dates, etc

Writing Basic Hive QL Queries using WHERE, JOIN, GROUP BY, etc

Analytical or Windowing Functions in Hive

Overview of Impala and understanding similarities and differences between Hive and Impala

Getting Started with Sqoop by reviewing official documentation and also exploring commands such as Sqoop eval

Importing Data from RDBMS tables into HDFS using Sqoop Import

Importing Data from RDBMS tables into Hive tables using Sqoop Import

Exporting Data from Hive or HDFS to RDBMS tables using Sqoop Export

Incremental Imports using Sqoop Import into HDFS or Hive Tables

Description

As part of Sqoop, Hive, and Impala for Data Analysts (Formerly CCA 159), you will learn key skills such as Sqoop, Hive, and Impala.

This comprehensive course covers all aspects of the certification with real-world examples and data sets.

Overview of Big Data ecosystem

  • Overview Of Distributions and Management Tools
  • Properties and Properties Files – General Guidelines
  • Hadoop Distributed File System
  • YARN and Map Reduce2
  • Submitting Map ReduceJob
  • Determining Number of Mappers and Reducers
  • Understanding YARN and Map Reduce Configuration Properties
  • Review and Override Job Properties
  • Reviewing Map Reduce Job Logs
  • Map Reduce Job Counters
  • Overview of Hive
  • Databases and Query Engines
  • Overview of Data Ingestion in Big Data
  • Data Processing using Spark

HDFS Commands to manage files

  • Introduction to HDFS for Certification Exams
  • Overview of HDFS and PropertiesFiles
  • Overview of Hadoop CLI
  • Listing Files in HDFS
  • User Spaces or Home Directories in HDFS
  • Creating Directories in HDFS
  • Copying Files and Directories into HDFS
  • File and Directory Permissions Overview
  • Getting Files and Directories from HDFS
  • Previewing Text Files in HDFS
  • Copying or Moving Files and Directories within HDFS
  • Understanding Size of File System and Files
  • Overview of Block Size and ReplicationFactor
  • Getting File Metadata using hdfs fsck
  • Resources and Exercises

Getting Started with Hive

  • Overview of Hive Language Manual
  • Launching and using Hive CLI
  • Overview of Hive Properties
  • Hive CLI History and hiverc
  • Running HDFS Commands in Hive CLI
  • Understanding Warehouse Directory
  • Creating and Using Hive Databases
  • Creating and Describing Hive Tables
  • Retrieve Matadata of Tables using DESCRIBE
  • Role of Hive Metastore Database
  • Overview of beeline
  • Running Hive Commands and Queries using beeline

Creating Tables in Hive using Hive QL

  • Creating Tables in Hive – orders
  • Overview of Basic Data Types in Hive
  • Adding Comments to Columns and Tables
  • Loading Data into Hive Tables from Local File System
  • Loading Data into Hive Tables from HDFS
  • Loading Data – Overwrite vs Append
  • Creating External tables in Hive
  • Specifying Location for Hive Tables
  • Difference between Managed Table and External Table
  • Default Delimiters in Hive Tables using Text File
  • Overview of File Formats in Hive
  • Differences between Hive and RDBMS
  • Truncate and Drop tables in Hive
  • Resources and Exercises

Loading/Inserting data into Hive tables using Hive QL

  • Introduction to Partitioning and Bucketing
  • Creating Tables using Orc Format – order_items
  • Inserting Data into Tables using Stage Tables
  • Load vs. Insert in Hive
  • Creating Partitioned Tables in Hive
  • Adding Partitions to Tables in Hive
  • Loading into Partitions in Hive Tables
  • Inserting Data Into Partitions in Hive Tables
  • Insert Using Dynamic Partition Mode
  • Creating Bucketed Tables in Hive
  • Inserting Data into Bucketed Tables
  • Bucketing with Sorting
  • Overview of ACID Transactions
  • Create Tables for Transactions
  • Inserting Individual Records into Hive Tables
  • Update and Delete Data in Hive Tables

Overview of functions in Hive

  • Overview of Functions
  • Validating Functions
  • String Manipulation – Case Conversion and Length
  • String Manipulation – substr and split
  • String Manipulation – Trimming and Padding Functions
  • String Manipulation – Reverse and Concatenating Multiple Strings
  • Date Manipulation – Current Date and Timestamp
  • Date Manipulation – Date Arithmetic
  • Date Manipulation – trunc
  • Date Manipulation – Using date format
  • Date Manipulation – Extract Functions
  • Date Manipulation – Dealing with Unix Timestamp
  • Overview of Numeric Functions
  • Data Type Conversion Using Cast
  • Handling Null Values
  • Query Example – Get Word Count

Writing Basic Queries in Hive

  • Overview of SQL or Hive QL
  • Execution Life Cycle of Hive Query
  • Reviewing Logs of Hive Queries
  • Projecting Data using Select and Overview of From
  • Derive Conditional Values using CASE and WHEN
  • Projecting Distinct Values
  • Filtering Data using Where Clause
  • Boolean Operations in Where Clause
  • Boolean OR vs IN Operator
  • Filtering Data using LIKE Operator
  • Performing Basic Aggregations using Aggregate Functions
  • Performing Aggregations using GROUP BY
  • Filtering Aggregated Data Using HAVING
  • Global Sorting using ORDER BY
  • Overview of DISTRIBUTE BY
  • Sorting Data within Groups using SORT BY
  • Using CLUSTERED BY

Joining Data Sets and Set Operations in Hive

  • Overview of Nested Sub Queries
  • Nested Sub Queries – Using IN Operator
  • Nested Sub Queries – Using EXISTS Operator
  • Overview of Joins in Hive
  • Performing Inner Joins using Hive
  • Performing Outer Joins using Hive
  • Performing Full Outer Joins using Hive
  • Map Side Join and Reduce Side Join in Hive
  • Joining in Hive using Legacy Syntax
  • Cross Joins in Hive
  • Overview of Set Operations in Hive
  • Perform Set Union between two Hive Query Results
  • Set Operations – Intersect and Minus Not Supported

Windowing or Analytics Functions in Hive

  • Prepare HR Database in Hive with Employees Table
  • Overview of Analytics or Windowing Functions in Hive
  • Performing Aggregations using Hive Queries
  • Create Tables to Get Daily Revenue using CTAS in Hive
  • Getting Lead and Lag using Windowing Functions in Hive
  • Getting First and Last Values using Windowing Functions in Hive
  • Applying Rank using Windowing Functions in Hive
  • Applying Dense Rank using Windowing Functions in Hive
  • Applying Row Number using Windowing Functions in Hive
  • Difference Between rank, dense_rank, and row_number in Hive
  • Understanding the order of execution of Hive Queries
  • Overview of Nested Sub Queries in Hive
  • Filtering Data on Top of Window Functions in Hive
  • Getting Top 5 Products by Revenue for Each Day using Windowing Functions in Hive – Recap

Running Queries using Impala

  • Introduction to Impala
  • Role of Impala Daemons
  • Impala State Store and Catalog Server
  • Overview of Impala Shell
  • Relationship between Hive and Impala
  • Overview of Creating Databases and Tables using Impala
  • Loading and Inserting Data into Tables using Impala
  • Running Queries using Impala Shell
  • Reviewing Logs of Impala Queries
  • Synching Hive and Impala – Using Invalidate Metadata
  • Running Scripts using Impala Shell
  • Assignment – Using NYSE Data
  • Assignment – Solution

Getting Started with Sqoop


Get Instant Notification of New Courses on our Telegram channel.


  • Introduction to Sqoop
  • Validate Source Database – MySQL
  • Review JDBC Jar to Connect to MySQL
  • Getting Help using Sqoop CLI
  • Overview of Sqoop User Guide
  • Validate Sqoop and MySQL Integration using Sqoop List Databases
  • Listing Tables in Database using Sqoop
  • Run Queries in MySQL using Sqoop Eval
  • Understanding Logs in Sqoop
  • Redirecting Sqoop Job Logs into Log Files

Importing data from MySQL to HDFS using Sqoop Import

  • Overview of Sqoop Import Command
  • Import Orders using target-dir
  • Import Order Items using warehouse-dir
  • Managing HDFS Directories
  • Sqoop Import Execution Flow
  • Reviewing Logs of Sqoop Import
  • Sqoop Import Specifying Number of Mappers
  • Review the Output Files generated by Sqoop Import
  • Sqoop Import Supported File Formats
  • Validating avro files using Avro Tools
  • Sqoop Import Using Compression

Apache Sqoop – Importing Data into HDFS – Customizing

  • Introduction to customizing Sqoop Import
  • Sqoop Import by Specifying Columns
  • Sqoop import Using Boundary Query
  • Sqoop import while filtering Unnecessary Data
  • Sqoop Import Using Split By to distribute import using non default column
  • Getting Query Results using Sqoop eval
  • Dealing with tables with Composite Keys while using Sqoop Import
  • Dealing with tables with Non Numeric Key Fields while using Sqoop Import
  • Dealing with tables with No Key Fields while using Sqoop Import
  • Using autoreset-to-one-mapper to use only one mapper while importing data using Sqoop from tables with no key fields
  • Default Delimiters used by Sqoop Import for Text File Format
  • Specifying Delimiters for Sqoop Import using Text File Format
  • Dealing with Null Values using Sqoop Import
  • Import Mulitple Tables from source database using Sqoop Import

Importing data from MySQL to Hive Tables using Sqoop Import

  • Quick Overview of Hive
  • Create Hive Database for Sqoop Import
  • Create Empty Hive Table for Sqoop Import
  • Import Data into Hive Table from source database table using Sqoop Import
  • Managing Hive Tables while importing data using Sqoop Import using Overwrite
  • Managing Hive Tables while importing data using Sqoop Import – Errors Out If Table Already Exists
  • Understanding Execution Flow of Sqoop Import into Hive tables
  • Review Files generated by Sqoop Import in Hive Tables
  • Sqoop Delimiters vs Hive Delimiters
  • Different File Formats supported by Sqoop Import while importing into Hive Tables
  • Sqoop Import all Tables into Hive from source database

Exporting Data from HDFS/Hive to MySQL using Sqoop Export

  • Introduction to Sqoop Export
  • Prepare Data for Sqoop Export
  • Create Table in MySQL for Sqoop Export
  • Perform Simple Sqoop Export from HDFS to MySQL table
  • Understanding Execution Flow of Sqoop Export
  • Specifying Number of Mappers for Sqoop Export
  • Troubleshooting the Issues related to Sqoop Export
  • Merging or Upserting Data using Sqoop Export – Overview
  • Quick Overview of MySQL – Upsert using Sqoop Export
  • Update Data using Update Key using Sqoop Export
  • Merging Data using allowInsert in Sqoop Export
  • Specifying Columns using Sqoop Export
  • Specifying Delimiters using Sqoop Export
  • Using Stage Table for Sqoop Export

Submitting Sqoop Jobs and Incremental Sqoop Imports

  • Introduction to Sqoop Jobs
  • Adding Password File for Sqoop Jobs
  • Creating Sqoop Job
  • Run Sqoop Job
  • Overview of Incremental Loads using Sqoop
  • Incremental Sqoop Import – Using Where
  • Incremental Sqoop Import – Using Append Mode
  • Incremental Sqoop Import – Create Table
  • Incremental Sqoop Import – Create Sqoop Job
  • Incremental Sqoop Import – Execute Job
  • Incremental Sqoop Import – Add Additional Data
  • Incremental Sqoop Import – Rerun Job
  • Incremental Sqoop Import – Using Last Modified

Here are the objectives for this course.

Provide Structure to the Data

Use Data Definition Language (DDL) statements to create or alter structures in the metastore for use by Hive and Impala.

  • Create tables using a variety of data types, delimiters, and file formats
  • Create new tables using existing tables to define the schema
  • Improve query performance by creating partitioned tables in the metastore
  • Alter tables to modify the existing schema
  • Create views in order to simplify queries

Data Analysis

Use Query Language (QL) statements in Hive and Impala to analyze data on the cluster.

  • Prepare reports using SELECT commands including unions and subqueries
  • Calculate aggregate statistics, such as sums and averages, during a query
  • Create queries against multiple data sources by using join commands
  • Transform the output format of queries by using built-in functions
  • Perform queries across a group of rows using windowing functions

Exercises will be provided to have enough practice to get better at Sqoop as well as writing queries using Hive and Impala.

All the demos are given on our state-of-the-art Big Data cluster. If you do not have multi-node cluster, you can sign up for our labs and practice on our multi-node cluster. You will be able to practice Sqoop and Hive on the cluster.

English
language

Content

Introduction

CCA 159 Certification Exam – Overview
Tools for preparation
Getting Details about the Exam
Signing up for the Exam

Using Cloudera QuickStart VM

Download and Install Virtual Box
Setup Cloudera QuickStart VM
Overview of Cloudera QuickStart VM
Overview of MySQL Databases
Setup NYSE Database in MySQL
Overview of HDFS and Setup Datasets
Overview of Hive and Create External Table
Validate Sqoop

Using ITVersity labs

Signing up for the labs
Connecting to the gateway node of the cluster
Overview of HDFS in the cluster
Using Hive in the cluster
Understanding MySQL in the cluster
Running Sqoop Commands in the cluster

Overview of Big Data eco system

Overview of Distributions and Management Tools such as Ambari
Properties and Properties Files of Big Data Tools – General Guidelines
Hadoop Distributed File System – Quick Overview
Distributed Computing using YARN and Map Reduce 2 – Quick Overview
Submitting Map Reduce Job in YARN Framework
Determining Number of Mappers and Reducers
Understanding YARN and Map Reduce Configuration Properties
Reviewing and Overriding Map Reduce Job Run Time Properties
Reviewing Map Reduce Job Logs – using Resource Manager and Job History Server UI
Map Reduce Job Counters
Overview of Hive
Databases in Big Data and Query Engines
Overview of Data Ingestion Tools in Big Data

Overview of HDFS Commands

Introduction to HDFS for Certification Exams
Overview of HDFS and Properties Files
Overview of “hadoop fs” or “hdfs dfs” command
Listing Files in HDFS
User Spaces or Home Directories in HDFS
Creating Directory in HDFS
Copying Files and Directories into HDFS
File and Directory Permissions Overview
Getting Files and Directories from HDFS
Previewing Text Files in HDFS – cat and tail
Copying or Moving Files from one HDFS location to other HDFS location
Understanding Size of the File System and Data Sets – using df and du
Overview of Block Size and Replication Factor
Getting metadata of files using “hdfs fsck”
Resources and Exercises

Apache Hive – Getting Started

Overview of Hive Language Manual
Launching and Using Hive CLI
Overview of Hive Properties – SET and .hiverc
Hive CLI History and .hiverc
Running HDFS Commands using Hive CLI
Understanding Warehouse Directory
Creating Database in Hive and Switching to the Database
Creating First Table in Hive and list the tables
Retrieve metadata of Hive Tables using DESCRIBE (EXTENDED and FORMATTED)
Role of Hive Metastore
Overview of beeline – Alternative to Hive CLI
Running Hive Queries using Beeline

Apache Hive – Managing Tables in Hive

Create tables in Hive – orders
Overview of Data Types in Hive
Adding Comments to Columns and Tables
Loading Data into Hive Tables from Local File System
Loading Data into Hive Tables from HDFS Location
Loading Data into Hive Tables – Overwrite vs. Append
Creating External Tables in Hive
Specifying Location for Hive Tables
Managed Tables vs. External Tables
Default Delimiters in Hive Tables using Text File Format
Overview of File Formats – STORED AS Clause
Differences between Hive and RDBMS
Truncating and Dropping tables in Hive
Resources and Exercises

Apache Hive – Managing Tables in Hive – Partitioning and Bucketing

Introduction to Partitioning and Bucketing in Hive
Creating Tables using orc File Format – order_items
Inserting Data into order_items using stage table
Can we use LOAD Command to get data into order_items with orc file format?
Creating Partitioned Tables in Hive – orders_part with order_month as key
Adding Partitions to Tables in Hive
Loading into Partitions in Hive Tables
Inserting Data into Partitions in Hive Tables
Inserting data into Partitioned Tables – Using dynamic partition mode
Creating Bucketed Tables – orders_buck and order_items_buck
Inserting Data Into Bucketed Tables
Bucketing with Sorting
Overview of ACID Transactions in Hive
Create Tables for ACID Transactions
Inserting individual records into Hive Tables
Updating and Deleting data in Hive Bucketed Tables

Apache Hive – Overview of Functions

Overview of Functions
Validating Functions
String Manipulation – Case Conversion and Length
String Manipulation – substr and split
String Manipulation – trimming and padding Functions
String Manipulation – Reverse and Concatenating multiple strings
Date Manipulation – Getting Current Date and Timestamp
Date Manipulation – Date Arithmetic
Date Manipulation – trunc
Date Manipulation – Extracting information using date_format
Date Manipulation – Extracting information using year, month, day etc
Date Manipulation – Dealing with Unix Timestamp
Overview of Numeric Functions
Type Cast Functions for Data Type Conversion
Handling null values using nvl
Query Example – Get Word Count

Apache Hive – Writing Basic Queries

Overview of SQL
Hive Query – Execution Life Cycle
Reviewing Logs for Hive Queries
Projecting Data using SELECT and Overview of FROM Clause
Using CASE and WHEN as part of SELECT Clause
Projecting DISTINCT Values
Filtering Data using WHERE Clause
Boolean Operations such as OR and AND using multiple fields
Boolean OR vs. IN
Filtering data using LIKE Operator
Basic Aggregations using Aggregate Functions
Performing basic aggregations such as SUM, MIN, MAX etc using GROUP BY
Filtering post aggregation using HAVING
Global Sorting using ORDER BY
Overview of DISTRIBUTE BY
Sorting Data with in groups using DISTRIBUTE BY and SORT BY
Overview of CLUSTER BY

Apache Hive – Writing Basic Queries – Joins and Set Operations

Overview of Nested Sub Queries
Nested Sub Queries in WHERE Clause with IN or NOT IN
Nested Sub Queries in WHERE Clause with EXISTS or NOT EXISTS
Overview of Joins
Joining Multiple Tables in Hive
Outer Joins in Hive
Full Outer Joins in Hive
Map Side Join vs. Reduce Side Join
Joining using Legacy Syntax
Cartesian between two data sets
Overview of SET Operations
Perform Union between two Data Sets
Not Supported – Perform Intersection or Minus between two Data Sets

Apache Hive – Analytics or Windowing Functions

Prepare HR Database with employees table
Overview of Analytics Functions or Windowing Functions
Performing Aggregations
Create tables to get daily revenue and daily product revenue
Getting Lead and Lag using Windowing Functions – order by
Getting Lead and Lag using Windowing Functions – order by and partition by
Windowing Functions – Using first_value and last_value
Applying rank Function
Applying dense_rank Function
Applying row_number Function
Difference between rank, dense_rank and row_number
Understanding Order of Execution
Quick recap of Nested Sub Queries
Filtering data using fields derived using analytics or windowing functions

Running Queries using Impala

Introduction to Impala
Role of Impala Daemons
Impala State Store and Catalog Server
Overview of impala-shell
Relationship between Hive and Impala
Overview of Creating Databases and Tables in Hive
Loading and Inserting Data into Impala Tables
Running Queries using Impala Shell
Reviewing Logs of Impala Queries
Synching Hive Metadata with Impala – using INVALIDATE METADATA
Running Scripts using Impala Shell

Apache Sqoop – Getting Started

Introduction to Sqoop
Validate Source Database – MySQL
Review JDBC Jar file to connect to MySQL
Getting help of Sqoop using Command Line
Overview of Sqoop User Guide
Validate Sqoop and MySQL integration using “sqoop list-databases”
List tables in MySQL using “sqoop list-tables”
Run Queries in MySQL using “sqoop eval”
Understanding Logs in Sqoop
Redirecting Sqoop Logs into files

Apache Sqoop – Importing Data into HDFS

Overview of Sqoop Import Command
Perform Sqoop Import of orders – –table and –target-dir
Perform Sqoop import of order_items – –warehouse-dir
Sqoop Import – Managing HDFS Directories – append or overwrite or fail
Sqoop Import – Execution Flow
Reviewing logs of Sqoop Import
Sqoop Import – Specifying Number of Mappers
Review the Output Files
Sqoop Import – Supported File Formats
Validating avro Files using avro-tools
Sqoop Import – Using Compression

Apache Sqoop – Importing Data into HDFS – Customizing

Sqoop Import – Customizing – Introduction
Sqoop Import – Specifying Columns
Sqoop Import – Using boundary query
Sqoop Import – Filter unnecessary data
Sqoop Import – Using Split By
Sqoop Import – Importing Query Results
Sqoop Import – Dealing with Composite Keys
Sqoop Import – Dealing with Primary Key or Split By using Non Numeric Field
Sqoop Import – Dealing with Tables with out Primary Key
Sqoop Import – Autoreset to One Mapper
Sqoop Import – Default Delimiters using Text File Format
Sqoop Import – Specifying Delimiters – Import NYSE Data with t as delimiter
Sqoop Import – Dealing with NULL Values
Sqoop Import – import-all-tables

Apache Sqoop – Importing Data into Hive Tables

Sqoop Import – Importing Data into Hive tables – Overview
Quick Overview of Hive
Sqoop Import – Create Hive Database
Creating empty Hive Table using create-hive-table
Sqoop Import – Import orders table to Hive Database
Sqoop Import – Managing Table using Hive Import – Overwrite
Sqoop Import – Managing Table using Hive Import – Error out – create-hive-table
Sqoop Import – Understanding Execution Flow while importing into Hive Table
Sqoop Import – Review files in Hive Tables
Sqoop Delimiters vs. Hive Delimiters – Text Files
Sqoop Import – Hive File Formats
Sqoop Import all tables – Hive

Apache Sqoop – Exporting Data from HDFS to RDBMS

Introduction
Prepare data for Export
Creating Table in MySQL
Sqoop Export – Perform Simple Export – –table and –export-dir
Sqoop Export – Execution Flow
Sqoop Export – Specifying Number of Mappers
Sqoop Export – Troubleshooting the issues
Sqoop Export – Merging or Upserting Overview
Sqoop Export – Quick Overview of MySQL for Upsert
Sqoop Export – Using update-mode – update-only (default)
Sqoop Export – Using update-mode – allow-inseert
Sqoop Export – Specifying Columns
Sqoop Export – Specifying Delimiters
Sqoop Export – Using Stage Table

Apache Sqoop – Incremental Imports and Jobs

Overview of Sqoop Jobs
Adding Password File
Creating Sqoop Job
Running Sqoop Job
Overview of Incremental Imports
Incremental Import – Using where
Incremental Import – Append Mode
Incremental Import – Create training_orders_incr in retail_export
Incremental Import – Create Job
Incremental Import – Execute Job
Incremental Import – Add additional data (order_id > 30000)
Incremental Import – Rerun the job and validate results
Incremental Import – Using mode lastmodified