Course Information
Course Name
DSCI-272: Predicting with Cloudera Machine Learning
Exam code
CDP-6001
Duration
4 Days
Certification
Cloudera Machine Learning Engineer
Overview
Enterprise data science teams need collaborative access to business data, tools, and computing resources required to develop and deploy machine learning workflows. Cloudera Machine Learning (CML), part of the Cloudera Data Platform (CDP), provides the solution, giving data science teams the required resources.
This four-day course covers machine learning workflows and operations using CML. Participants explore, visualize, and analyze data. You will also train, evaluate, and deploy machine learning models.
The course walks through an end-to-end data science and machine learning workflow based on realistic scenarios and datasets from a fictitious technology company. The demonstrations and exercises are conducted in Python (with PySpark) using CML.
Audience Profile
The course is designed for data scientists who need to understand how to utilize Cloudera Machine Learning and the Cloudera Data Platform to achieve faster model development and deliver production machine learning at scale. Data engineers, developers, and solution architects who collaborate with data scientists will also find this course valuable.
Prerequisities
At Course Completion
Through lecture and hands-on exercises, you will learn how to:
Utilize Cloudera SDX and other components of the Cloudera Data Platform to locate data for machine learning experiments
Use an Applied ML Prototype (AMP)
Manage machine learning experiments
Connect to various data sources and explore data
Utilize Apache Spark and Spark ML
Deploy an ML model as a REST API
Manage and monitor deployed ML models
Course Outline
Module 1: Introduction to CML
Overview
CML Versus CDSW
ML Workspaces
Workspace Roles
Projects and Teams
Settings
Runtimes/Legacy Engines
Module 2: Introduction to AMPs and the Workbench
Editors and IDE
Git
Embedded Web Applications
AMPs
Module 3: Data Access and Lineage
SDX Overview
Data Catalog
Authorization
Lineage
Module 4: Data Visualization in CML
Data Visualization Overview
CDP Data Visualization Concepts
Using Data Visualization in CML
Module 5: Experiments
Experiments in CML
Module 6: Introduction to the CML Native Workbench
Entering Code
Getting Help
Accessing the Linux Command Line
Working With Python Packages
Formatting Session Output
Module 7: Spark Overview
How Spark Works
The Spark Stack
File Formats in Spark
Spark Interface Languages
Introduction to PySpark
How DataFrame Operations Become Spark Jobs
How Spark Executes a Job
Module 8: Running a Spark Application
Running a Spark Application
Reading data into a Spark SQL DataFrame
Examining the Schema of a DataFrame
Computing the Number of Rows and Columns of a DataFrame
Examining a Few Rows of a DataFrame
Stopping a Spark Application
Module 9: Inspecting a Spark DataFrame
Inspecting a DataFrame
Inspecting a DataFrame Column
Module 10: Transforming DataFrames
Spark SQL DataFrames
Working with Columns
Working with Rows
Working with Missing Values
Module 11: Transforming DataFrame Columns
Spark SQL Data Types
Working with Numerical Columns
Working with String Columns
Working with Date and Timestamp Columns
Working with Boolean Columns
Module 12: Complex Types
Complex Collection Data Types
Arrays
Maps
Structs
Module 13: User-Defined Functions
User-Defined Functions
Example 1: Hour of Day
Example 2: Great-Circle Distance
Module 14: Reading and Writing DataFrames
Working with Delimited Text Files
Working with Text Files
Working with Parquet Files
Working with Hive Tables
Working with Object Stores
Working with Pandas DataFrames
Module 15: Combining and Splitting DataFrames
Combining and Splitting DataFrames
Joining DataFrames
Splitting a DataFrame
Module 16: Summarizing and Grouping DataFrames
Summarizing Data with Aggregate Functions
Grouping Data
Pivoting Data
Module 17: Window Functions
Window Functions
Example: Cumulative Count and Sum
Example: Compute Average Days Between Rides for Each Rider
Module 18: Machine Learning Overview
Introduction to Machine Learning
Machine Learning Tools
Module 19: Apache Spark MLlib
Introduction to Apache Spark MLlib
Module 20: Exploring and Visualizing DataFrames
Possible Workflows for Big Data
Exploring a Single Variable
Exploring a Pair of Variables
Module 21: Monitoring, Tuning, and Configuring Spark Applications
Monitoring Spark Applications
Configuring the Spark Environment
Module 22: Fitting and Evaluating Regression Models
Assemble the Feature Vector
Fit the Linear Regression Model
Module 23: Fitting and Evaluating Classification Models
Generate Label
Fit the Logistic Regression Model
Module 24: Tuning Algorithm Hyperparameters Using Grid Search
Requirements for Hyperparameter Tuning
Tune the Hyperparameters Using Holdout Cross-Validation
Tune the Hyperparameters Using K-Fold Cross-Validation
Module 25: Fitting and Evaluating Clustering Models
Print and Plot the Home Coordinates
Fit a Gaussian Mixture Model
Explore the Cluster Profiles
Module 26: Processing Text: Fitting and Evaluating Topic Models
Fit a Topic Model Using Latent Dirichlet Allocation
Module 27: Fitting and Evaluating Recommender Models
Recommender Models
Generate Recommendations
Module 28: Working with Machine Learning Pipelines
Fit the Pipeline Model
Inspect the Pipeline Model
Module 29: Applying a Scikit-Learn Model to a Spark DataFrame
Build a Scikit-Learn Model
Apply the Model Using a Spark UDF
Module 30: Deploying a Machine Learning Model as a REST API in CML
Load the Serialized Model
Define a Wrapper Function to Generate a Prediction
Test the Function
Module 31: Autoscaling, Performance, and GPU Settings
Autoscaling Workloads
Working with GPUs
Module 32: Model Metrics and Monitoring
Why Monitor Models?
Common Models Metrics
Models Monitoring With Evidently
Continuous Model Monitoring
All Cloudera certification courses are conducted by certified trainers from Iverson.
Digital Methods acts as the official training partner and assists with program consultation, registration, coordination, scheduling, and administrative arrangements to ensure a seamless and well-managed training experience.