Data Architecture - Data Lakehouse

Quiz

This chapter focuses on the concept of a Data Lakehouse, a new approach to managing and analyzing data. We'll explain what a data lakehouse is, how it combines the best features of data lakes and data warehouses, and why it's becoming a popular choice for storing and processing large amounts of data. Here's what you'll learn:

What is Data Lakehouse?
Key Function of Data Lakehouse
How Does a Data Lakehouse Work?
Delta Lake Features
Data Lakehouse Architecture
When to Use Delta Lake
Use Cases of Data Lakehouse

Why Do We Need Data Lakehouses?
Benefits of a Data Lakehouse
Delta Lake for Lakehouses
Performance Improvements with Delta Lake
What If You Skip the Relational Data Warehouse?
Relational Serving Layer

What is a Data Lakehouse?

A data lakehouse is a way to store and manage data. It combines the benefits of both data lake and a data warehouse, into one system, making data storage simpler. Instead of using both systems, you can use just a data lake with improved capabilities.

Why Do We Need Data Lakehouses?

Data lakehouses solve issues from older systems where raw and processed data were kept separate, causing delays, high costs, and poor collaboration.

The Old Way

Before data lakehouses, companies faced these problems.

They had to maintain separate systems for raw data and processed data.
Moving data between systems was slow and expensive.
Data teams couldn't work together easily.
Data quality was hard to maintain.

The New Way

Data lakehouses solve these problems by:

Keeping all data in one place.
Making it easier to analyze data.
Saving money on storage.
Helping teams work together better.

Key Function of Data Lakehouse

When using a data lakehouse, you can:

Store any kind of data (numbers, text, pictures, videos).
Keep track of changes to your data over time.
Let different people to work with the data at the same time.
Make sure your data stays accurate and reliable.

Benefits of a Data Lakehouse

A data lakehouse offers several advantages, including:

It stores all data in one place.
It reduces costs by using a single system for both storage and processing.
It supports both real-time and batch data.
It keeps data clean and reliable with built-in checks.
It makes collaboration easier and speeds up analysis.

How Does a Data Lakehouse Work?

A Data Lakehouse works by combining a few steps to make data easier to store, manage, and analyze. Here's how it works:

Data Ingestion: The process starts by collecting data from different sources, like apps, sensors, and databases. This data is then stored in a system that can handle all kinds of data, whether structured, semi-structured, or unstructured.
Data Processing: Once the data is ingested, it's cleaned and transformed to make it ready for analysis. This step organizes the raw data into a more usable format.
Data Management: A metadata layer is used to track and manage the data. This helps keep everything organized and ensures that users can easily find and access the right data when they need it.
Data Analysis: Finally, users can run queries, generate reports, and extract insights from the data to inform decision-making.

Delta Lake for Lakehouse

Delta Lake improves a data lake with features like improved reliability, security, and performance. It's not storage itself but works on top of a data lake. You can easily convert your data lake to a Delta Lake by saving data in Delta Lake format instead of formats like CSV or JSON.

When you use delta lake format, your data is stored as Parquet files with a transaction log that tracks all changes. This improves the data lake's functionality. Most tools support it due to its popularity.

Other options for adding functionality to data lakes include Apache Iceberg and Apache Hudi.

Delta Lake Features

Delta Lake adds several important features to data lakes, making them more like relational data warehouses. Here are some important features.

DML Support

Delta Lake supports DML commands like INSERT, DELETE, UPDATE, and MERGE, making data management easier. Unlike traditional data lakes, which only handle batch processing and don't allow real-time updates, delta lake lets you efficiently update data without rewriting entire files.

Delta Tables and Transaction Logs

In Delta Lake, data is organized into Delta Tables, which split large tables into smaller files for better management. A transaction log keeps track of changes, speeding up DML operations by optimizing storage and using in-memory processing. For example, when you run an UPDATE statement, it only reads and updates the necessary files instead of the entire table.

ACID Transactions

Delta Lake supports ACID (Atomicity, Consistency, Isolation, Durability) properties for transactions, but only within a single Delta Table. Unlike relational databases that can handle transactions across multiple tables, Delta Lake's ACID support is limited to one table at a time.

Time Travel

Delta Lake includes a "time travel" feature that allows you to query data as it was at a specific point in time. The transaction log tracks all changes, so you can easily access previous versions of your data or revert changes if necessary. This feature is particularly useful for auditing and data recovery.

Small Files Problem

Delta Lake addresses the "small files" problem, where having too many small files can hurt performance and raise storage costs. It automatically uses compaction algorithms to merge small files into larger ones, which improves efficiency and reduces storage overhead.

Unified Processing

Delta Lake allows users to handle both batch and real-time streaming on the same data. This makes the data processing workflow and architecture easier, removing the need for separate systems for batch and streaming tasks.

Schema Enforcement

Delta Lake uses schema rules to ensure that the data written to a Delta Table meets specified constraints, such as data types and uniqueness. This helps prevent data corruption by rejecting any invalid data during write operations.

Performance Improvements with Delta Lake

Delta Lake improves the performance of data lakes in many ways:

Data Skipping: Delta Lake can skip irrelevant data when reading from a Delta Table, so queries only focus on the necessary data, which speeds up performance.
Caching: By supporting data caching in Spark, delta lake makes repeated queries faster, reducing the time it takes to run them after the first execution.
Fast Indexing: Delta Lake uses an optimized indexing structure to quickly find the data you need, speeding up query execution.
Query Optimization: Delta Lake works with Spark SQL to make queries faster and more efficient by using Spark's built-in optimization.
Predicate Pushdown: Filters are applied directly at the storage layer, meaning less data needs to be processed, which speeds up query execution.
Column Pruning: Only the needed columns are read, reducing data processing and speeding up queries.
Vectorized Execution: Delta Lake processes multiple data points with a single CPU instruction, improving CPU performance and overall speed.
Parallel Processing: Delta Lake supports running tasks in parallel, allowing multiple operations to be processed at the same time for faster results.
Z-order: Delta Lake uses Z-order indexing to organize data for quicker and more optimal access, improving how fast you can query it.

Data Lakehouse Architecture

The data lakehouse architecture makes data management easier by combining the features of both data lake and data warehouses.

In a data lakehouse, data moves through the same stages as other systems: ingestion, storage, transformation, modeling, and visualization. However, instead of using separate data lakes and relational data warehouses, everything is stored in a single data lake that uses delta lake technology.

This approach solves many common problems in traditional data systems.

Reliability: Keeping data consistent between a data lake and a relational data warehouse can be difficult, data transfers can fail or result in mismatched data. With a data lakehouse, there's no need to copy data between systems, eliminating these issues.
Data Staleness: Data in an relational data warehouses can get outdated because it's only updated at set times, leading to inconsistent reports. A data lakehouse keeps all data in one place, ensuring it's always up to date.
Advanced Analytics Support: Relational data warehouse aren't well-suited for advanced analytics, like AI and machine learning, because these tools work better with raw data found in data lakes. A data lakehouse makes it easier for data scientists to work directly with the data they need.
Cost Efficiency: Managing both relational data warehouse and a data lake is expensive. A data lakehouse stores everything in one place, cutting storage and compute costs.
Data Governance: With data stored in separate systems, it's harder to manage access and maintain quality. A data lakehouse uses a single copy of data, making governance and security simpler.
Complexity: Managing both a data lake and an relational data warehouse requires specialized skills, making it more complex. A data lakehouse reduces this complexity by consolidating everything into one platform.

What If You Skip the Relational Data Warehouse?

Skipping the Relational Data Warehouse in favor of a data lakehouse can be a good option, especially for smaller datasets. With a data lakehouse (using Delta Lake), you only need one data storage system, which saves on storage and compute costs. You don't have to copy data to an relational data warehouse, which reduces both costs and complexity.

However, there are some important challenges to consider.

Performance: Relational Data Warehouses are faster for complex queries due to features like indexing, caching, and query optimization. Delta Lake may not match this performance, especially with large datasets.
Security: Relational Data Warehouses offer better security features, such as row-level security, encryption, and auditing, which Delta Lake lacks.
Concurrency: Relational Data Warehouses can handle more users and tasks at the same time. Delta Lake may have trouble with a high number of users.
Metadata Management: Relational Data Warehouses manage data information (metadata) more easily since it's part of the system. Delta Lake's file-based system can cause problems with metadata.
Learning Curve: People used to relational data warehouses may find Delta Lake's system harder to use and may need extra training.

When to Use Delta Lake

Use Delta Lake when:

Queries are not time-sensitive: If you don't need real-time results, Delta Lake offers good performance at a lower cost.
Advanced features aren't necessary: If you don't need things like complex query optimization, fast joins, or special indexing, Delta Lake is a simpler and more affordable choice.
Smaller datasets: For smaller data, Delta Lake works well and avoids the complexity of a full Relational Data Warehouse setup.
Cost is a priority: If keeping costs low is important, Delta Lake helps you save on storage and compute, especially with serverless options.

In short, Delta Lake is a good choice when you prioritize simplicity, cost savings, and decent performance over high performance or advanced analytics.

Relational Serving Layer

Delta Lake doesn't have predefined metadata or relationships like a traditional relational data warehouse. Instead, it uses a schema-on-read approach, applying the schema when data is read, not when it's stored.

To make the data easier to understand, you need to create a relational serving layer. This layer links the data with its metadata and defines relationships between different pieces of data. You can build this using:

SQL Views
Reporting Tools
Apache Hive Tables
Ad-hoc SQL Queries

Once set up, you can work with the data just like they would in an Relational Data Warehouse, without needing to know it's stored in Delta Lake.

However, there are some challenges:

Metadata might not always match the data perfectly.
Different layers may point to the same data but have inconsistent metadata, which can lead to errors or confusion.

Use Cases of Data Lakehouse

A data lakehouse is useful in many scenarios, including:

Business Analytics: Companies use data lakehouses to track sales trends, understand customer preferences, and make smarter decisions about what products to focus on or restock.
Scientific Research: Researchers store and share their data in a data lakehouse, helping them collaborate with others and spot key patterns or trends in their studies.
Healthcare Management: Hospitals use data lakehouses to organize patient records, monitor how well treatments are working, and manage hospital resources like medical supplies.

Print Page