Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Introduction to Vaex in Python
In the realm of data science, handling large datasets presents significant challenges in memory management and execution speed. Vaex is a Python library designed specifically to address these problems by providing lazy evaluation and out-of-core processing capabilities.
Vaex excels at analyzing, manipulating, and visualizing big data by leveraging modern hardware like multi-core CPUs and SSDs. Unlike traditional libraries, Vaex uses memory-mapping and lazy evaluation to perform calculations only when necessary.
Why Use Vaex?
Vaex overcomes limitations found in libraries like Pandas through several key features ?
Lazy Evaluation Computations are performed only when needed
Memory Mapping No data copying, minimal RAM usage
Virtual Columns Store expressions without consuming memory
Fast Visualization Interactive plots with large datasets
Installation
Install Vaex using pip or conda ?
# Using pip pip install --upgrade vaex # Using conda conda install -c conda-forge vaex
import vaex
# Load example dataset
df_v = vaex.example()
print(f"Dataset shape: {df_v.shape}")
print(df_v.head())
Dataset shape: (330000, 11)
# x y z vx vy vz E L Lz FeH id
0 -0.777 -0.55 -1.04 -41.5 6.12 -9.84 -121 491 -41 -1.78 0
1 1.54 0.31 0.57 120 -108 -22.8 -119 -21.5 -12.4 -1.64 1
2 0.446 2.24 -1.12 -90.2 24.2 27 -128 238 56.6 -1.31 2
3 1.34 -1.69 -0.55 28 12.2 -26.1 -112 -9.36 -34.9 -2.04 3
4 5.14 0.77 1.32 -2.55 -74.3 -13.1 -118 1180 35.8 -1.87 4
Performance Comparison with Pandas
Data Loading Performance
import pandas as pd
import vaex
import time
# Vaex performance
start = time.time()
df_vaex = vaex.example()
vaex_time = time.time() - start
# Convert to Pandas for comparison
columns = df_vaex.get_column_names()
data = {col: df_vaex[col].values for col in columns}
start = time.time()
df_pandas = pd.DataFrame(data)
pandas_time = time.time() - start
print(f"Vaex loading time: {vaex_time:.4f} seconds")
print(f"Pandas loading time: {pandas_time:.4f} seconds")
print(f"Vaex is {pandas_time/vaex_time:.1f}x faster")
Vaex loading time: 0.0107 seconds Pandas loading time: 0.0137 seconds Vaex is 1.3x faster
Computation Performance
# Arithmetic operations
import time
# Pandas computation
start = time.time()
pandas_result = df_pandas['x'] + df_pandas['y']
pandas_compute_time = time.time() - start
# Vaex computation (lazy evaluation)
start = time.time()
vaex_result = df_vaex.x + df_vaex.y
vaex_compute_time = time.time() - start
print(f"Pandas computation: {pandas_compute_time:.4f} seconds")
print(f"Vaex computation: {vaex_compute_time:.4f} seconds")
print(f"Vaex is {pandas_compute_time/vaex_compute_time:.1f}x faster")
Pandas computation: 0.0022 seconds Vaex computation: 0.0003 seconds Vaex is 7.3x faster
Statistical Operations
# Statistical calculations comparison
import time
# Pandas mean calculation
start = time.time()
pandas_mean = df_pandas["L"].mean()
pandas_stats_time = time.time() - start
# Vaex mean calculation
start = time.time()
vaex_mean = df_vaex.mean(df_vaex.L)
vaex_stats_time = time.time() - start
print(f"Pandas mean: {pandas_mean:.5f} (Time: {pandas_stats_time:.4f}s)")
print(f"Vaex mean: {vaex_mean[0]:.5f} (Time: {vaex_stats_time:.4f}s)")
Pandas mean: 920.81793 (Time: 0.0042s) Vaex mean: 920.81803 (Time: 0.0025s)
Data Filtering
Vaex performs filtering without memory copying, making it extremely efficient ?
import time
# Pandas filtering
start = time.time()
df_pandas_filtered = df_pandas[df_pandas['x'] > 0]
pandas_filter_time = time.time() - start
# Vaex filtering (no memory copy)
start = time.time()
df_vaex_filtered = df_vaex[df_vaex['x'] > 0]
vaex_filter_time = time.time() - start
print(f"Pandas filtering: {pandas_filter_time:.4f} seconds")
print(f"Vaex filtering: {vaex_filter_time:.4f} seconds")
print(f"Filtered records: {len(df_vaex_filtered):,}")
Pandas filtering: 0.0197 seconds Vaex filtering: 0.0013 seconds Filtered records: 164,799
Virtual Columns
Virtual columns store expressions without consuming additional memory ?
# Create virtual column
df_vaex['x_squared'] = df_vaex['x']**2
# Virtual column behaves like a regular column
print(f"Mean of x_squared: {df_vaex.mean(df_vaex.x_squared)[0]:.5f}")
print(f"Virtual columns don't consume extra memory")
print(f"Column names: {df_vaex.get_column_names()[-3:]}") # Show last 3 columns
Mean of x_squared: 52.94399 Virtual columns don't consume extra memory Column names: ['FeH', 'id', 'x_squared']
Multiple Selections
# Create multiple selections
df_vaex.select(df_vaex.id < 15, name='small_ids')
df_vaex.select(df_vaex.id >= 15, name='large_ids')
# Compute statistics for both selections in one pass
means = df_vaex.mean(df_vaex.id, selection=['small_ids', 'large_ids'])
print(f"Mean of small IDs: {means[0]:.2f}")
print(f"Mean of large IDs: {means[1]:.2f}")
Mean of small IDs: 7.01 Mean of large IDs: 23.50
Performance Summary
| Operation | Pandas | Vaex | Improvement |
|---|---|---|---|
| Data Loading | 0.0137s | 0.0107s | 1.3x faster |
| Arithmetic | 0.0022s | 0.0003s | 7.3x faster |
| Statistics | 0.0042s | 0.0025s | 1.7x faster |
| Filtering | 0.0197s | 0.0013s | 15x faster |
Conclusion
Vaex excels at handling large datasets through lazy evaluation, memory mapping, and virtual columns. It significantly outperforms Pandas in filtering operations and provides efficient statistical computations without memory overhead.
