Extracting features with PySpark


Advertisements

In this chapter, we will learn about the application of the extracting features with PySpark in Agile Data Science.

Overview of Spark

Apache Spark can be defined as a fast real-time processing framework. It does computations to analyze data in real time. Apache Spark is introduced as stream processing system in real-time and can also take care of batch processing. Apache Spark supports interactive queries and iterative algorithms.

Spark is written in “Scala programming language”.

PySpark can be considered as a combination of Python with Spark. PySpark offers PySpark shell, which links Python API to the Spark core and initializes the Spark context. Most of the data scientists use PySpark for tracking features as discussed in the previous chapter.

In this example, we will focus on the transformations to build a dataset called counts and save it to a particular file.

text_file = sc.textFile("hdfs://...")
counts = text_file.flatMap(lambda line: line.split(" ")) \
   .map(lambda word: (word, 1)) \
   .reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://...")

Using PySpark, a user can work with RDDs in python programming language. The inbuilt library, which covers the basics of Data Driven documents and components, helps in this.

Useful Video Courses


Video

Agile Methodology

14 Lectures 1 hours

Mahesh Kumar

Video

Agile Project Management: Scrum Step by Step with Examples

61 Lectures 1 hours

Paul Ashun

Video

Agile Vs Waterfall project methodologies comparison

7 Lectures 25 mins

Angelo Tofalo

Video

Agile for Security Teams

19 Lectures 1.5 hours

Cristina Gheorghisan

Video

Agile Project Management: Agile, Scrum, Kanban & XP

Featured

50 Lectures 4 hours

Monika Rawat

Advertisements