Article Categories

Selected Reading

Introduction to Vaex in Python

Python Server Side Programming Programming

In the realm of data science, handling large datasets presents significant challenges in memory management and execution speed. Vaex is a Python library designed specifically to address these problems by providing lazy evaluation and out-of-core processing capabilities.

Vaex excels at analyzing, manipulating, and visualizing big data by leveraging modern hardware like multi-core CPUs and SSDs. Unlike traditional libraries, Vaex uses memory-mapping and lazy evaluation to perform calculations only when necessary.

Why Use Vaex?

Vaex overcomes limitations found in libraries like Pandas through several key features ?

Lazy Evaluation Computations are performed only when needed
Memory Mapping No data copying, minimal RAM usage
Virtual Columns Store expressions without consuming memory
Fast Visualization Interactive plots with large datasets

Installation

Install Vaex using pip or conda ?

# Using pip
pip install --upgrade vaex

# Using conda
conda install -c conda-forge vaex

import vaex

# Load example dataset
df_v = vaex.example()
print(f"Dataset shape: {df_v.shape}")
print(df_v.head())

Dataset shape: (330000, 11)
     #    x        y        z       vx        vy        vz       E         L         Lz        FeH      id
     0   -0.777   -0.55    -1.04   -41.5      6.12     -9.84    -121     491       -41       -1.78      0
     1    1.54     0.31     0.57   120       -108      -22.8    -119     -21.5     -12.4     -1.64      1
     2    0.446    2.24    -1.12   -90.2      24.2      27       -128     238        56.6     -1.31      2
     3    1.34    -1.69    -0.55    28        12.2     -26.1     -112     -9.36     -34.9     -2.04      3
     4    5.14     0.77     1.32   -2.55     -74.3     -13.1     -118     1180       35.8     -1.87      4

Performance Comparison with Pandas

Data Loading Performance

import pandas as pd
import vaex
import time

# Vaex performance
start = time.time()
df_vaex = vaex.example()
vaex_time = time.time() - start

# Convert to Pandas for comparison
columns = df_vaex.get_column_names()
data = {col: df_vaex[col].values for col in columns}

start = time.time()
df_pandas = pd.DataFrame(data)
pandas_time = time.time() - start

print(f"Vaex loading time: {vaex_time:.4f} seconds")
print(f"Pandas loading time: {pandas_time:.4f} seconds")
print(f"Vaex is {pandas_time/vaex_time:.1f}x faster")

Vaex loading time: 0.0107 seconds
Pandas loading time: 0.0137 seconds
Vaex is 1.3x faster

Computation Performance

# Arithmetic operations
import time

# Pandas computation
start = time.time()
pandas_result = df_pandas['x'] + df_pandas['y']
pandas_compute_time = time.time() - start

# Vaex computation (lazy evaluation)
start = time.time()
vaex_result = df_vaex.x + df_vaex.y
vaex_compute_time = time.time() - start

print(f"Pandas computation: {pandas_compute_time:.4f} seconds")
print(f"Vaex computation: {vaex_compute_time:.4f} seconds")
print(f"Vaex is {pandas_compute_time/vaex_compute_time:.1f}x faster")

Pandas computation: 0.0022 seconds
Vaex computation: 0.0003 seconds
Vaex is 7.3x faster

Statistical Operations

# Statistical calculations comparison
import time

# Pandas mean calculation
start = time.time()
pandas_mean = df_pandas["L"].mean()
pandas_stats_time = time.time() - start

# Vaex mean calculation
start = time.time()
vaex_mean = df_vaex.mean(df_vaex.L)
vaex_stats_time = time.time() - start

print(f"Pandas mean: {pandas_mean:.5f} (Time: {pandas_stats_time:.4f}s)")
print(f"Vaex mean: {vaex_mean[0]:.5f} (Time: {vaex_stats_time:.4f}s)")

Pandas mean: 920.81793 (Time: 0.0042s)
Vaex mean: 920.81803 (Time: 0.0025s)

Data Filtering

Vaex performs filtering without memory copying, making it extremely efficient ?

import time

# Pandas filtering
start = time.time()
df_pandas_filtered = df_pandas[df_pandas['x'] > 0]
pandas_filter_time = time.time() - start

# Vaex filtering (no memory copy)
start = time.time()
df_vaex_filtered = df_vaex[df_vaex['x'] > 0]
vaex_filter_time = time.time() - start

print(f"Pandas filtering: {pandas_filter_time:.4f} seconds")
print(f"Vaex filtering: {vaex_filter_time:.4f} seconds")
print(f"Filtered records: {len(df_vaex_filtered):,}")

Pandas filtering: 0.0197 seconds
Vaex filtering: 0.0013 seconds
Filtered records: 164,799

Virtual Columns

Virtual columns store expressions without consuming additional memory ?

# Create virtual column
df_vaex['x_squared'] = df_vaex['x']**2

# Virtual column behaves like a regular column
print(f"Mean of x_squared: {df_vaex.mean(df_vaex.x_squared)[0]:.5f}")
print(f"Virtual columns don't consume extra memory")
print(f"Column names: {df_vaex.get_column_names()[-3:]}")  # Show last 3 columns

Mean of x_squared: 52.94399
Virtual columns don't consume extra memory
Column names: ['FeH', 'id', 'x_squared']

Multiple Selections

# Create multiple selections
df_vaex.select(df_vaex.id < 15, name='small_ids')
df_vaex.select(df_vaex.id >= 15, name='large_ids')

# Compute statistics for both selections in one pass
means = df_vaex.mean(df_vaex.id, selection=['small_ids', 'large_ids'])
print(f"Mean of small IDs: {means[0]:.2f}")
print(f"Mean of large IDs: {means[1]:.2f}")

Mean of small IDs: 7.01
Mean of large IDs: 23.50

Performance Summary

Operation	Pandas	Vaex	Improvement
Data Loading	0.0137s	0.0107s	1.3x faster
Arithmetic	0.0022s	0.0003s	7.3x faster
Statistics	0.0042s	0.0025s	1.7x faster
Filtering	0.0197s	0.0013s	15x faster

Conclusion

Vaex excels at handling large datasets through lazy evaluation, memory mapping, and virtual columns. It significantly outperforms Pandas in filtering operations and provides efficient statistical computations without memory overhead.

Harischandra Prasad

Updated on: 2026-03-27T15:25:03+05:30

536 Views

Previous Next