Spark SQL - Data Sources



A DataFrame interface allows different DataSources to work on Spark SQL. It is a temporary table and can be operated as a normal RDD. Registering a DataFrame as a table allows you to run SQL queries over its data.

In this chapter, we will describe the general methods for loading and saving data using different Spark DataSources. Thereafter, we will discuss in detail the specific options that are available for the built-in data sources.

There are different types of data sources available in SparkSQL, some of which are listed below −

Sr. No Data Sources
1 JSON Datasets

Spark SQL can automatically capture the schema of a JSON dataset and load it as a DataFrame.

2 Hive Tables

Hive comes bundled with the Spark library as HiveContext, which inherits from SQLContext.

3 Parquet Files

Parquet is a columnar format, supported by many data processing systems.

Advertisements