Preparing with Cloudera Data Engineering

Course Information

Course Name

DENG-254: Preparing with Cloudera Data Engineering

Exam code

CDP-3002

Duration

4 Days

Certification

Cloudera Data Engineer

Overview

This four-day hands-on training course delivers the key concepts and knowledge developers need to use Apache Spark to develop high-performance, parallel applications on the Cloudera Data Platform (CDP).

Hands-on exercises allow students to practice writing Spark applications that integrate with CDP core components. Participants will learn how to use Spark SQL to query structured data, how to use Hive features to ingest and denormalize data, and how to work with “big data” stored in a distributed file system.

After taking this course, participants will be prepared to face real-world challenges and build applications to execute faster decisions, better decisions, and interactive analysis, applied to a wide variety of use cases, architectures, and industries.

Audience Profile

This course is designed for developers and data engineers.

Prerequisites

Students are expected to have basic Linux experience, and basic proficiency with either Python or Scala programming languages. Basic knowledge of SQL is helpful. Prior knowledge of Spark and Hadoop is not required.

At Course Completion

During this course, you will learn how to:

Distribute, store, and process data in a CDP cluster
Write, configure, and deploy Apache Spark applications
Use the Spark interpreters and Spark applications to explore, process, and analyze distributed data
Query data using Spark SQL, DataFrames, and Hive tables
Deploy a Spark application on the Data Engineering Service

Course Outline

Module 1: HDFS Introduction

· HDFS Overview

· HDFS Components and Interactions

· Additional HDFS Interactions

· Ozone Overview

· Exercise: Working with HDFS

Module 2: YARN Introduction

· YARN Overview

· YARN Components and Interaction

· Working with YARN

· Exercise: Working with YARN

Module 3: Working with RDDs

· Resilient Distributed Datasets (RDDs)

· Exercise: Working with RDDs

Module 4: Working with DataFrames

· Introduction to DataFrames

· Exercise: Introducing DataFrames

· Exercise: Reading and Writing DataFrames

· Exercise: Working with Columns

· Exercise: Working with Complex Types

· Exercise: Combining and Splitting DataFrames

· Exercise: Summarizing and Grouping DataFrames

· Exercise: Working with UDFs

· Exercise: Working with Windows

Module 5: Introduction to Apache Hive

· About Hive

· Transforming data with Hive QL

Module 6: Working with Apache Hive

· Exercise: Working with Partitions

· Exercise: Working with Buckets

· Exercise: Working with Skew

· Exercise: Using Serdes to Ingest Text Data

· Exercise: Using Complex Types to Denormalize Data

Module 7: Hive and Spark Integration

· Hive and Spark Integration

· Exercise: Spark Integration with Hive

Module 8: Distributed Processing Challenges

· Shuffle

· Skew

· Orde

Module 9: Spark Distributed Processing

Spark Distributed Processing
Exercise: Explore Query Execution Order

Module 10: Spark Distributed Persistence

DataFrame and Dataset Persistence
Persistence Storage Levels
Viewing Persisted RDDs
Exercise: Persisting DataFrames

Module 11: Data Engineering Service

Create and Trigger Ad-Hoc Spark Jobs
Orchestrate a Set of Jobs Using Airflow
Data Lineage using Atlas
Auto-scaling in Data Engineering Service

Module 12: Workload XM

Optimize Workloads, Performance, Capacity
Identify Suboptimal Spark Jobs

All Cloudera certification courses are conducted by certified trainers from Iverson.

Digital Methods acts as the official training partner and assists with program consultation, registration, coordination, scheduling, and administrative arrangements to ensure a seamless and well-managed training experience.

Page updated

Google Sites

Report abuse