PySpark & AWS: Master Big Data With PySpark and AWS

Learn how to use Spark, Pyspark AWS, Spark applications, Spark EcoSystem, Hadoop and Mastering PySpark

Course Description

Comprehensive Course Description:

The hottest buzzwords in the Big Data analytics industry are Python and Apache Spark. PySpark supports the collaboration of Python and Apache Spark. In this course, you’ll start right from the basics and proceed to the advanced levels of data analysis. From cleaning data to building features and implementing machine learning (ML) models, you’ll learn how to execute end-to-end workflows using PySpark.

Right through the course, you’ll be using PySpark for performing data analysis. You’ll explore Spark RDDs, Dataframes, and a bit of Spark SQL queries. Also, you’ll explore the transformations and actions that can be performed on the data using Spark RDDs and data frames. You’ll also explore the ecosystem of Spark and Hadoop and their underlying architecture. You’ll use the Databricks environment for running the Spark scripts and explore it as well.

Finally, you’ll have a taste of Spark with AWS cloud. You’ll see how we can leverage AWS storages, databases, and computations, and how Spark can communicate with different AWS services and get its required data.

How Is This Course Different?

In this Learning by Doing course, every theoretical explanation is followed by practical implementation.

The course ‘PySpark & AWS: Master Big Data With PySpark and AWS’ is crafted to reflect the most in-demand workplace skills. This course will help you understand all the essential concepts and methodologies with regard to PySpark. The course is:

  • Easy to understand.
  • Expressive.
  • Exhaustive.
  • Practical with live coding.
  • Rich with state-of-the-art and latest knowledge of this field.

As this course is a detailed compilation of all the basics, it will motivate you to make quick progress and experience much more than what you have learned. At the end of each concept, you will be assigned Homework/tasks/activities/quizzes along with solutions. This is to evaluate and promote your learning based on the previous concepts and methods you have learned. Most of these activities will be coding-based, as the aim is to get you up and running with implementations.

High-quality video content, in-depth course material, evaluating questions, detailed course notes, and informative handouts are some of the perks of this course. You can approach our friendly team in case of any course-related queries, and we assure you of a fast response.

The course tutorials are divided into 140+ brief videos. You’ll learn the concepts and methodologies of PySpark and AWS along with a lot of practical implementation. The total runtime of the HD videos is around 16 hours.

Why Should You Learn PySpark and AWS?

PySpark is the Python library that makes the magic happen.

PySpark is worth learning because of the huge demand for Spark professionals and the high salaries they command. The usage of PySpark in Big Data processing is increasing at a rapid pace compared to other Big Data tools.

AWS, launched in 2006, is the fastest-growing public cloud. The right time to cash in on cloud computing skills—AWS skills, to be precise—is now.

Course Content:

The all-inclusive course consists of the following topics:

1. Introduction:

a. Why Big Data?

b. Applications of PySpark

c. Introduction to the Instructor

d. Introduction to the Course

e. Projects Overview

2. Introduction to Hadoop, Spark EcoSystems, and Architectures:

a. Hadoop EcoSystem

b. Spark EcoSystem

c. Hadoop Architecture

d. Spark Architecture

e. PySpark Databricks setup

f. PySpark local setup

3. Spark RDDs:

a. Introduction to PySpark RDDs

b. Understanding underlying Partitions

c. RDD transformations

d. RDD actions

e. Creating Spark RDD

f. Running Spark Code Locally

g. RDD Map (Lambda)

h. RDD Map (Simple Function)

i. RDD FlatMap

j. RDD Filter

k. RDD Distinct

l. RDD GroupByKey

m. RDD ReduceByKey

n. RDD (Count and CountByValue)

o. RDD (saveAsTextFile)

p. RDD (Partition)

q. Finding Average

r. Finding Min and Max

s. Mini project on student data set analysis

t. Total Marks by Male and Female Student

u. Total Passed and Failed Students

v. Total Enrollments per Course

w. Total Marks per Course

x. Average marks per Course

y. Finding Minimum and Maximum marks

z. Average Age of Male and Female Students

4. Spark DFs:

a. Introduction to PySpark DFs

b. Understanding underlying RDDs

c. DFs transformations

d. DFs actions

e. Creating Spark DFs

f. Spark Infer Schema

g. Spark Provide Schema

h. Create DF from RDD

i. Select DF Columns

j. Spark DF with Column

k. Spark DF with Column Renamed and Alias

l. Spark DF Filter rows

m. Spark DF (Count, Distinct, Duplicate)

n. Spark DF (sort, order By)

o. Spark DF (Group By)

p. Spark DF (UDFs)

q. Spark DF (DF to RDD)

r. Spark DF (Spark SQL)

s. Spark DF (Write DF)

t. Mini project on Employees' data set analysis

u. Project Overview

v. Project (Count and Select)

w. Project (Group By)

x. Project (Group By, Aggregations, and Order By)

y. Project (Filtering)

z. Project (UDF and With Column)

aa. Project (Write)

5. Collaborative filtering:

a. Understanding collaborative filtering

b. Developing a recommendation system using the ALS model

c. Utility Matrix

d. Explicit and Implicit Ratings

e. Expected Results

f. Dataset

g. Joining Dataframes

h. Train and Test Data

i. ALS model

j. Hyperparameter tuning and cross-validation

k. Best model and evaluate predictions

l. Recommendations

6. Spark Streaming:

a. Understanding the difference between batch and streaming analysis.

b. Hands-on with spark streaming through word count example

c. Spark Streaming with RDD

d. Spark Streaming Context

e. Spark Streaming Reading Data

f. Spark Streaming Cluster Restart

g. Spark Streaming RDD Transformations

h. Spark Streaming DF

i. Spark Streaming Display

j. Spark Streaming DF Aggregations

7. ETL Pipeline

a. Understanding the ETL

b. ETL pipeline Flow

c. Data set

d. Extracting Data

e. Transforming Data

f. Loading data (Creating RDS)

g. Load data (Creating RDS)

h. RDS Networking

i. Downloading Postgres

j. Installing Postgres

k. Connect to RDS through PgAdmin

l. Loading Data

8. Project – Change Data Capture / Replication On Going

a. Introduction to Project

b. Project Architecture

c. Creating RDS MySql Instance

d. Creating an S3 Bucket

e. Creating DMS Source Endpoint

f. Creating DMS Destination Endpoint

g. Creating a DMS Instance

h. MySql WorkBench

i. Connecting with RDS and Dumping Data

j. Querying RDS

k. DMS Full Load

l. DMS Replication Ongoing

m. Stopping Instances

n. Glue Job (Full Load)

o. Glue Job (Change Capture)

p. Glue Job (CDC)

q. Creating Lambda Function and Adding Trigger

r. Checking Trigger

s. Getting S3 file name in Lambda

t. Creating Glue Job

u. Adding Invoke for Glue Job

v. Testing Invoke

w. Writing Glue Shell Job

x. Full Load Pipeline

y. Change Data Capture Pipeline

After the successful completion of this course, you will be able to:

  • Relate the concepts and practicals of Spark and AWS with real-world problems.
  • Implement any project that requires PySpark knowledge from scratch.
  • Know the theory and practical aspects of PySpark and AWS.

Who this course is for:

  • People who are beginners and know absolutely nothing about PySpark and AWS.
  • People who want to develop intelligent solutions.
  • People who want to learn PySpark and AWS.
  • People who love to learn theoretical concepts first before implementing them using Python.
  • People who want to learn PySpark along with its implementation in realistic projects.
  • Big Data Scientists.
  • Big Data Engineers.

Who this course is for:

  • People who are beginners and know absolutely nothing about PySpark and AWS.
  • People who want to develop intelligent solutions.
  • People who want to learn PySpark and AWS.
  • People who love to learn theoretical concepts first before implementing them using Python.
  • People who want to learn PySpark along with its implementation in realistic projects.
  • Big Data Scientists.
  • Big Data Engineers.


  •  The introduction and importance of Big Data.

  • Practical explanation and live coding with PySpark.

  • Spark applications

  • Spark EcoSystem

  • Spark Architecture

  • Hadoop EcoSystem

  • Hadoop Architecture

  • PySpark RDDs

  • PySpark RDD transformations

  • PySpark RDD actions

  • PySpark DataFrames

  • PySpark DataFrames transformations

  • PySpark DataFrames actions

  • Collaborative filtering in PySpark

  • Spark Streaming

  • ETL Pipeline

  • CDC and Replication on Going


  •  Prior knowledge of Python.

  •  An elementary understanding of programming.

  •  A willingness to learn and practice.

Show More


Course Rating

    Feedbacks (1)

  • wasim

PySpark & AWS: Master Big Data With PySpark and AWS
This Course Includes
  • 19 hours
  • 196 Lectures
  • 10 Resources
  • Completion Certificate Sample Certificate
  • Lifetime Access Yes
  • Language English
  • 30-Days Money Back Guarantee

Sample Certificate

sample certificate

Use your certification to make a career change or to advance in your current career. Salaries are among the highest in the world.

We have 30 Million registered users and counting who have advanced their careers with us.


Sample Certificate

Talk to us