• Post category:StudyBullet-4
  • Reading time:12 mins read


Learn HDFS commands, Hadoop, Spark SQL, SQL Queries, ETL & Data Analysis| Spark Hadoop Cluster VM | Fully Solved Qs

What you will learn

Students will get hands-on experience working in a Spark Hadoop environment that’s free and downloadable as part of this course.

Students will have opportunities solve Data Engineering and Data Analysis Problems using Spark on a Hadoop cluster in the sandbox environment that comes as part

Issuing HDFS commands.

Converting a set of data values in a given format stored in HDFS into new data values or a new data format and writing them into HDFS.

Loading data from HDFS for use in Spark applications & writing the results back into HDFS using Spark.

Reading and writing files in a variety of file formats.

Performing standard extract, transform, load (ETL) processes on data using the Spark API.

Using metastore tables as an input source or an output sink for Spark applications.

Applying the understanding of the fundamentals of querying datasets in Spark.

Filtering data using Spark.

Writing queries that calculate aggregate statistics.

Joining disparate datasets using Spark.

Producing ranked or sorted data.

Description

Apache Spark is currently one of the most popular systems for processing big data.

Apache Hadoop continues to be used by many organizations that look to store data locally on premises. Hadoop allows these organisations to efficiently store big datasets ranging in size from gigabytes to petabytes.

As the number of vacancies for data science, big data analysis and data engineering roles continue to grow, so too will the demand for individuals that possess knowledge of Spark and Hadoop technologies to fill these vacancies.

This course has been designed specifically for data scientists, big data analysts and data engineers looking to leverage the power of Hadoop and Apache Spark to make sense of big data.

This course will help those individuals that are looking to interactively analyse big data or to begin writing production applications to prepare data for further analysis using Spark SQL in a Hadoop environment.

The course is also well suited for university students and recent graduates that are keen to gain exposure to Spark & Hadoop or anyone who simply wants to apply their SQL skills in a big data environment using Spark-SQL.

This course has been designed to be concise and to provide students with a necessary and sufficient amount of theory, enough for them to be able to use Hadoop & Spark without getting bogged down in too much theory about older low-level APIs such as RDDs.


Get Instant Notification of New Courses on our Telegram channel.


On solving the questions contained in this course students will begin to develop those skills & the confidence needed to handle real world scenarios that come their way in a production environment.

(a) There are just under 30 problems in this course. These cover hdfs commands, basic data engineering tasks and data analysis.

(b) Fully worked out solutions to all the problems.

(c) Also included is the Verulam Blue virtual machine which is an environment that has a spark Hadoop cluster already installed so that you can practice working on the problems.

  • The VM contains a Spark Hadoop environment which allows students to read and write data to & from the Hadoop file system as well as to store metastore tables on the Hive metastore.
  • All the datasets students will need for the problems are already loaded onto HDFS, so there is no need for students to do any extra work.
  • The VM also has Apache Zeppelin installed. This is a notebook specific to Spark and is similar to Python’s Jupyter notebook.

This course will allow students to get hands-on experience working in a Spark Hadoop environment as they practice:

  • Converting a set of data values in a given format stored in HDFS into new data values or a new data format and writing them into HDFS.
  • Loading data from HDFS for use in Spark applications & writing the results back into HDFS using Spark.
  • Reading and writing files in a variety of file formats.
  • Performing standard extract, transform, load (ETL) processes on data using the Spark API.
  • Using metastore tables as an input source or an output sink for Spark applications.
  • Applying the understanding of the fundamentals of querying datasets in Spark.
  • Filtering data using Spark.
  • Writing queries that calculate aggregate statistics.
  • Joining disparate datasets using Spark.
  • Producing ranked or sorted data.
English
language

Content

Introduction

The Udemy Environment

Introduction to Hadoop & Spark

Section Introduction
Big Data
Distributed Storage & Processing
Introduction to Hadoop
Introduction to Spark
Spark Applications
Spark’s Interactive Shell
Distributed Processing on a Hadoop Cluster using Spark

Our Working Environment

Section Introduction
Install Oracle VM VirtualBox
The Verulam Blue VM – Zipped Files for Downloading
Loading the Verulam Blue VM
Booting up the VM
Spin Up Cluster
spark-shell
Run Zeppelin Notebook
Problems & practice test questions

HDFS Basic File Management

Interacting with HDFS
The File System Shell (FS Shell)
Commands and operations -help
Commands and operations -ls
Commands and operations -find
Commands and operations -mkdir
Commands and operations -put
Commands and operations -cp -mv
Commands and operations -cat -tail -text
Commands and operations -rmdir -rm
Commands and operations -get
Health warning
HDFS Basic File Management – Problems & Solutions

Data Structures

Section Introduction
DataFrames
Tables
Temp Views

Spark SQL & Creating Data Structures

Section Introduction
Querying Data Structures using SQL via Spark SQL
Creating DataFrames with Spark SQL
Creating Databases & Tables with Spark SQL
Creating Temporary Views with Spark SQL

Basic Operations on Data Structures

Section Introduction
Operations on DataFrame columns
Operations on DataFrame rows
Basic SQL queries for Tables

Data Engineering

Section Introduction
The ETL Process
The Extract Phase of an ETL process
The Extract Phase – Loading CSV and Text files
The Extract Phase – Loading JSON and Parquet files
The Extract Phase – Loading Avro and ORC files
The Transform Phase of an ETL process
The Transform Phase – String Transformations
The Transform Phase – Numerical Transformations
The Transform Phase – Date & Time Transformations
The Transform Phase – Data Type Transformations
The Transform Phase – Transformations of Nulls
The Load Phase of an ETL process
The Load Phase – Saving DataFrame data to Files I
The Load Phase – Saving DataFrame data to Files II
The Load Phase – Saving DataFrame data to Tables
Data Engineering – Solutions to Problems

Data Analysis

Section Introduction
Metastore Tables as Input Sources or Output Sinks
Querying datasets in Spark
Math Functions in SQL
Filtering
Sorting & Ranking
Aggregation
Grouping
Multi Table Queries
Multi Table Queries – Joins
Multi Table Queries – Types of Joins
Multi Table Queries – Unions
Data Analysis – Solutions to Problems

End of Course Test Solutions

End of Course Test Solutions

Appendix – Hadoop Theory

HDFS Architecture
YARN Architecture

Appendix – Spark Theory

Components of a Spark application
The Driver Process
The Executor Process
The Master Process
The Spark Application Execution Model
Deploying Spark Applications on Hadoop clusters