1. DANA-262: Analyzing with Cloudera Data Warehouse
Course Information
Course Name
DANA-262: Analyzing with Cloudera Data Warehouse
Exam code
CDP-4001
Duration
4 Days
Certification
Cloudera Data Analyst
Overview
This four-day Analyzing with Data Warehouse course will teach you to apply traditional data analytics and business intelligence skills to big data. This course presents the tools data professionals need to access, manipulate, transform, and analyze complex data sets using SQL and familiar scripting languages.
Audience Profile
This course is designed for data analysts, business intelligence specialists, developers, system architects, and database administrators.
Prerequisites
Some knowledge of SQL is assumed, as is basic Linux command-line familiarity.
At Course Completion
Through instructor-led discussion and interactive, hands-on exercises, participants will navigate the ecosystem, learning how to:
Use Apache Hive and Apache Impala to access data through queries
Identify distinctions between Hive and Impala, such as differences in syntax, data formats, and supported features
Write and execute queries that use functions, aggregate functions, and subqueries
Use joins and unions to combine datasets
Create, modify, and delete tables, views, and databases
Load data into tables and store query results
Select file formats and develop partitioning schemes for better performance
Use analytic and windowing functions to gain insight into their data
Store and query complex or nested data structures
Process and analyze semi-structured and unstructured data
Optimize and extend the capabilities of Hive and Impala
Determine whether Hive, Impala, an RDBMS, or a mix of these is the best choice for a given task
Utilize the benefits of CDP Public Cloud Data Warehouse
Course Outline
Module 1: Foundations for Big Data Analytics
Big Data Analytics Overview
Data Storage: HDFS
Distributed Data Processing: YARN, MapReduce, and Spark
Data Processing and Analysis: Hive and Impala
Database Integration: Sqoop
Other Data Tools
Exercise Scenario Explanation
Module 2: Introduction to Apache Hive and Impala
What Is Hive?
What Is Impala?
Why Use Hive and Impala?
Schema and Data Storage
Comparing Hive and Impala to Traditional Databases
Use Cases
Module 3: Querying with Apache Hive and Impala
Databases and Tables
Basic Hive and Impala Query Language Syntax
Data Types
Using Hue to Execute Queries
Using Beeline (Hive’s Shell)
Using the Impala Shell
Module 4: Common Operators and Built-In Functions
Operators
Scalar Functions
Aggregate Functions
Module 5: Data Management
Data Storage
Creating Databases and Tables
Loading Data
Altering Databases and Tables
Simplifying Queries with Views
Storing Query Results
Module 6: Data Storage and Performance
Partitioning Tables
Loading Data into Partitioned Tables
When to Use Partitioning
Choosing a File Format
Using Avro and Parquet File Formats
Module 7: Working with Multiple Datasets
UNION and Joins
Handling NULL Values in Joins
Advanced Joins
Module 8: Analytic Functions and Windowing
Using Common Analytic Functions
Other Analytic Functions
Sliding Windows
Module 9: Complex Data
Complex Data with Hive
Complex Data with Impala
Module 10: Analyzing Text
Using Regular Expressions with Hive and Impala
Processing Text Data with SerDes in Hive
Sentiment Analysis and n-grams
Module 11: Apache Hive Optimization
Understanding Query Performance
Bucketing
Hive on Spark
Module 12: Apache Impala Optimization
How Impala Executes Queries
Improving Impala Performance
Module 13: Extending Apache Hive and Impala
Custom SerDes and File Formats in Hive
Data Transformation with Custom Scripts in Hive
User-Defined Functions
Parameterized Queries
Module 14: Choosing the Best Tool for the Job
Comparing Hive, Impala, and Relational Databases
Which to Choose?
Module 15: Conclusion
How Impala Executes Queries
Improving Impala Performance
All Cloudera certification courses are conducted by certified trainers from Iverson.
Digital Methods acts as the official training partner and assists with program consultation, registration, coordination, scheduling, and administrative arrangements to ensure a seamless and well-managed training experience.