Tutorialspoint

April Learning Carnival is here, Use code FEST10 for an extra 10% off

Master Big Data With PySpark and AWS

person icon AISciences

4.2

Master Big Data With PySpark and AWS

Learn how to use Spark, Pyspark AWS, Spark applications, Spark EcoSystem, Hadoop and Mastering PySpark

updated on icon Updated on Apr, 2024

language icon Language - English

person icon AISciences

category icon Data Science,PySpark,Development

Lectures -196

Resources -10

Duration -19 hours

4.2

price-loader

30-days Money-Back Guarantee

Training 5 or more people ?

Get your team access to 10000+ top Tutorials Point courses anytime, anywhere.

Course Description

Comprehensive Course Description:

The hottest buzzwords in the Big Data analytics industry are Python and Apache Spark. PySpark supports the collaboration of Python and Apache Spark. In this course, you’ll start right from the basics and proceed to the advanced levels of data analysis. From cleaning data to building features and implementing machine learning (ML) models, you’ll learn how to execute end-to-end workflows using PySpark.

Right through the course, you’ll be using PySpark for performing data analysis. You’ll explore Spark RDDs, Dataframes, and a bit of Spark SQL queries. Also, you’ll explore the transformations and actions that can be performed on the data using Spark RDDs and data frames. You’ll also explore the ecosystem of Spark and Hadoop and their underlying architecture. You’ll use the Databricks environment for running the Spark scripts and explore it as well.

Finally, you’ll have a taste of Spark with AWS cloud. You’ll see how we can leverage AWS storages, databases, and computations, and how Spark can communicate with different AWS services and get its required data.

How Is This Course Different?

In this Learning by Doing course, every theoretical explanation is followed by practical implementation.

The course ‘PySpark & AWS: Master Big Data With PySpark and AWS’ is crafted to reflect the most in-demand workplace skills. This course will help you understand all the essential concepts and methodologies with regard to PySpark. The course is:

  • Easy to understand.
  • Expressive.
  • Exhaustive.
  • Practical with live coding.
  • Rich with state-of-the-art and latest knowledge of this field.

As this course is a detailed compilation of all the basics, it will motivate you to make quick progress and experience much more than what you have learned. At the end of each concept, you will be assigned Homework/tasks/activities/quizzes along with solutions. This is to evaluate and promote your learning based on the previous concepts and methods you have learned. Most of these activities will be coding-based, as the aim is to get you up and running with implementations.

High-quality video content, in-depth course material, evaluating questions, detailed course notes, and informative handouts are some of the perks of this course. You can approach our friendly team in case of any course-related queries, and we assure you of a fast response.

The course tutorials are divided into 140+ brief videos. You’ll learn the concepts and methodologies of PySpark and AWS along with a lot of practical implementation. The total runtime of the HD videos is around 16 hours.

Why Should You Learn PySpark and AWS?

PySpark is the Python library that makes the magic happen.

PySpark is worth learning because of the huge demand for Spark professionals and the high salaries they command. The usage of PySpark in Big Data processing is increasing at a rapid pace compared to other Big Data tools.

AWS, launched in 2006, is the fastest-growing public cloud. The right time to cash in on cloud computing skills—AWS skills, to be precise—is now.

Course Content:

The all-inclusive course consists of the following topics:

1. Introduction:

a. Why Big Data?

b. Applications of PySpark

c. Introduction to the Instructor

d. Introduction to the Course

e. Projects Overview

2. Introduction to Hadoop, Spark EcoSystems, and Architectures:

a. Hadoop EcoSystem

b. Spark EcoSystem

c. Hadoop Architecture

d. Spark Architecture

e. PySpark Databricks setup

f. PySpark local setup

3. Spark RDDs:

a. Introduction to PySpark RDDs

b. Understanding underlying Partitions

c. RDD transformations

d. RDD actions

e. Creating Spark RDD

f. Running Spark Code Locally

g. RDD Map (Lambda)

h. RDD Map (Simple Function)

i. RDD FlatMap

j. RDD Filter

k. RDD Distinct

l. RDD GroupByKey

m. RDD ReduceByKey

n. RDD (Count and CountByValue)

o. RDD (saveAsTextFile)

p. RDD (Partition)

q. Finding Average

r. Finding Min and Max

s. Mini project on student data set analysis

t. Total Marks by Male and Female Student

u. Total Passed and Failed Students

v. Total Enrollments per Course

w. Total Marks per Course

x. Average marks per Course

y. Finding Minimum and Maximum marks

z. Average Age of Male and Female Students

4. Spark DFs:

a. Introduction to PySpark DFs

b. Understanding underlying RDDs

c. DFs transformations

d. DFs actions

e. Creating Spark DFs

f. Spark Infer Schema

g. Spark Provide Schema

h. Create DF from RDD

i. Select DF Columns

j. Spark DF with Column

k. Spark DF with Column Renamed and Alias

l. Spark DF Filter rows

m. Spark DF (Count, Distinct, Duplicate)

n. Spark DF (sort, order By)

o. Spark DF (Group By)

p. Spark DF (UDFs)

q. Spark DF (DF to RDD)

r. Spark DF (Spark SQL)

s. Spark DF (Write DF)

t. Mini project on Employees' data set analysis

u. Project Overview

v. Project (Count and Select)

w. Project (Group By)

x. Project (Group By, Aggregations, and Order By)

y. Project (Filtering)

z. Project (UDF and With Column)

aa. Project (Write)

5. Collaborative filtering:

a. Understanding collaborative filtering

b. Developing a recommendation system using the ALS model

c. Utility Matrix

d. Explicit and Implicit Ratings

e. Expected Results

f. Dataset

g. Joining Dataframes

h. Train and Test Data

i. ALS model

j. Hyperparameter tuning and cross-validation

k. Best model and evaluate predictions

l. Recommendations

6. Spark Streaming:

a. Understanding the difference between batch and streaming analysis.

b. Hands-on with spark streaming through word count example

c. Spark Streaming with RDD

d. Spark Streaming Context

e. Spark Streaming Reading Data

f. Spark Streaming Cluster Restart

g. Spark Streaming RDD Transformations

h. Spark Streaming DF

i. Spark Streaming Display

j. Spark Streaming DF Aggregations

7. ETL Pipeline

a. Understanding the ETL

b. ETL pipeline Flow

c. Data set

d. Extracting Data

e. Transforming Data

f. Loading data (Creating RDS)

g. Load data (Creating RDS)

h. RDS Networking

i. Downloading Postgres

j. Installing Postgres

k. Connect to RDS through PgAdmin

l. Loading Data

8. Project – Change Data Capture / Replication On Going

a. Introduction to Project

b. Project Architecture

c. Creating RDS MySql Instance

d. Creating an S3 Bucket

e. Creating DMS Source Endpoint

f. Creating DMS Destination Endpoint

g. Creating a DMS Instance

h. MySql WorkBench

i. Connecting with RDS and Dumping Data

j. Querying RDS

k. DMS Full Load

l. DMS Replication Ongoing

m. Stopping Instances

n. Glue Job (Full Load)

o. Glue Job (Change Capture)

p. Glue Job (CDC)

q. Creating Lambda Function and Adding Trigger

r. Checking Trigger

s. Getting S3 file name in Lambda

t. Creating Glue Job

u. Adding Invoke for Glue Job

v. Testing Invoke

w. Writing Glue Shell Job

x. Full Load Pipeline

y. Change Data Capture Pipeline

After the successful completion of this course, you will be able to:

  • Relate the concepts and practicals of Spark and AWS with real-world problems.
  • Implement any project that requires PySpark knowledge from scratch.
  • Know the theory and practical aspects of PySpark and AWS.

Who this course is for:

  • People who are beginners and know absolutely nothing about PySpark and AWS.
  • People who want to develop intelligent solutions.
  • People who want to learn PySpark and AWS.
  • People who love to learn theoretical concepts first before implementing them using Python.
  • People who want to learn PySpark along with its implementation in realistic projects.
  • Big Data Scientists.
  • Big Data Engineers.

Who this course is for:

  • People who are beginners and know absolutely nothing about PySpark and AWS.
  • People who want to develop intelligent solutions.
  • People who want to learn PySpark and AWS.
  • People who love to learn theoretical concepts first before implementing them using Python.
  • People who want to learn PySpark along with its implementation in realistic projects.
  • Big Data Scientists.
  • Big Data Engineers.

Goals

What will you learn in this course:

  •  The introduction and importance of Big Data.

  • Practical explanation and live coding with PySpark.

  • Spark applications

  • Spark EcoSystem

  • Spark Architecture

  • Hadoop EcoSystem

  • Hadoop Architecture

  • PySpark RDDs

  • PySpark RDD transformations

  • PySpark RDD actions

  • PySpark DataFrames

  • PySpark DataFrames transformations

  • PySpark DataFrames actions

  • Collaborative filtering in PySpark

  • Spark Streaming

  • ETL Pipeline

  • CDC and Replication on Going

Prerequisites

What are the prerequisites for this course?

  •  Prior knowledge of Python.

  •  An elementary understanding of programming.

  •  A willingness to learn and practice.

Master Big Data With PySpark and AWS

Curriculum

Check out the detailed breakdown of what’s inside the course

Introduction
6 Lectures
  • play icon Introduction 01:44 01:44
  • play icon Why Big Data 03:23 03:23
  • play icon Applications of PySpark 03:12 03:12
  • play icon Introduction to Instructor 00:46 00:46
  • play icon Introduction to Course 01:49 01:49
  • play icon Projects Overview 03:25 03:25
Introduction to Hadoop, Spark EcoSystems and Architectures
17 Lectures
Tutorialspoint
Spark RDDs
37 Lectures
Tutorialspoint
Spark DFs
41 Lectures
Tutorialspoint
Collaborative filtering
12 Lectures
Tutorialspoint
Spark Streaming
10 Lectures
Tutorialspoint
ETL Pipeline
13 Lectures
Tutorialspoint
Project - Change Data Capture / Replication On Going
27 Lectures
Tutorialspoint
Chatbots Development with Amazon Lex
33 Lectures
Tutorialspoint

Instructor Details

AISciences

AISciences

We are a group of experts, PhDs, and Practitioners of Artificial Intelligence, Computer Science, Machine Learning, and Statistics. Some of us work in big companies like Amazon, Google, Facebook, Microsoft, KPMG, BCG, and IBM.

We decided to produce a series of courses mainly dedicated to beginners and newcomers on the techniques and methods of Machine Learning, Statistics, Artificial Intelligence, and Data Science. 

Initially, our objective was to help only those who wish to understand these techniques more easily and to be able to start without too much theory and without a long reading. Today we also publish a more complete course on some topics for a wider audience.

Our courses have had phenomenal success. Our Courses have helped more than 100,000 students to master AI and Data Science.


 ✅  Stay Connected to Us. 

👉 Twitter: https://twitter.com/AISciencesLearn 

👉 Facebook: https://www.facebook.com/AISciencesLearn   

👉 LinkedIn: https://www.linkedin.com/company/ai-sciences/

👉 Website: http://www.aisciences.io   


✅ For Business Inquires: contact@aisciences.io  

Course Certificate

Use your certificate to make a career change or to advance in your current career.

sample Tutorialspoint certificate

Our students work
with the Best

Feedbacks

P

pratik pawar

good

Related Video Courses

View More

Annual Membership

Become a valued member of Tutorials Point and enjoy unlimited access to our vast library of top-rated Video Courses

Subscribe now
Annual Membership

Online Certifications

Master prominent technologies at full length and become a valued certified professional.

Explore Now
Online Certifications

Talk to us

1800-202-0515