Find combined mean and variance of two series in Python

When working with two separate data series, you often need to find the combined mean and combined variance of the merged dataset. This is useful in statistics when combining samples from different populations or datasets.

The combined mean is calculated as a weighted average of individual means, while the combined variance uses the formula that accounts for both individual variances and the differences between individual means and the combined mean.

Mathematical Formula

For two series A1 and A2 with sizes n and m respectively:

  • Combined Mean: (n × mean? + m × mean?) / (n + m)

  • Combined Variance: [n × (var? + d?²) + m × (var? + d?²)] / (n + m)

  • Where d?² = (mean? - combined_mean)² and d?² = (mean? - combined_mean)²

Implementation

def mean(arr):
    return sum(arr) / len(arr)

def variance(arr, n):
    total = 0
    arr_mean = mean(arr)
    for i in range(n):
        total = total + ((arr[i] - arr_mean) * (arr[i] - arr_mean))
    var = total / n
    return var

def combined_statistics(series1, series2):
    n = len(series1)
    m = len(series2)
    
    # Calculate individual means
    mean1 = mean(series1)
    mean2 = mean(series2)
    print("Mean 1:", round(mean1, 2), " Mean 2:", round(mean2, 2))
    
    # Calculate individual variances
    var1 = variance(series1, n)
    var2 = variance(series2, m)
    print("Variance 1:", round(var1, 2), " Variance 2:", round(var2, 2))
    
    # Calculate combined mean
    combined_mean = (n * mean1 + m * mean2) / (n + m)
    print("Combined Mean:", round(combined_mean, 2))
    
    # Calculate squared differences from combined mean
    d1_square = (mean1 - combined_mean) ** 2
    d2_square = (mean2 - combined_mean) ** 2
    print("d1_square:", round(d1_square, 2), " d2_square:", round(d2_square, 2))
    
    # Calculate combined variance
    combined_var = (n * (var1 + d1_square) + m * (var2 + d2_square)) / (n + m)
    print("Combined Variance:", round(combined_var, 2))

# Example usage
series1 = [24, 46, 35, 79, 13, 77, 35]
series2 = [66, 68, 35, 24, 46]

combined_statistics(series1, series2)
Mean 1: 44.14  Mean 2: 47.8
Variance 1: 548.69  Variance 2: 294.56
Combined Mean: 45.67
d1_square: 2.32  d2_square: 4.55
Combined Variance: 446.06

Alternative Using NumPy

For simpler implementation, you can use NumPy functions ?

import numpy as np

def combined_stats_numpy(series1, series2):
    # Convert to numpy arrays
    arr1 = np.array(series1)
    arr2 = np.array(series2)
    
    # Individual statistics
    mean1, mean2 = np.mean(arr1), np.mean(arr2)
    var1, var2 = np.var(arr1), np.var(arr2)
    
    n, m = len(arr1), len(arr2)
    
    # Combined statistics
    combined_mean = (n * mean1 + m * mean2) / (n + m)
    d1_sq = (mean1 - combined_mean) ** 2
    d2_sq = (mean2 - combined_mean) ** 2
    combined_var = (n * (var1 + d1_sq) + m * (var2 + d2_sq)) / (n + m)
    
    print(f"Combined Mean: {combined_mean:.2f}")
    print(f"Combined Variance: {combined_var:.2f}")

series1 = [24, 46, 35, 79, 13, 77, 35]
series2 = [66, 68, 35, 24, 46]

combined_stats_numpy(series1, series2)
Combined Mean: 45.67
Combined Variance: 446.06

Key Points

  • The combined mean is a weighted average based on sample sizes

  • Combined variance accounts for both individual variances and mean differences

  • The d?² and d?² terms represent how much each individual mean deviates from the combined mean

  • This method is mathematically equivalent to calculating variance on the merged dataset

Conclusion

Combining statistics from multiple series is essential in data analysis. The weighted formulas ensure accurate results without needing to merge the actual datasets, making it memory-efficient for large datasets.

---
Updated on: 2026-03-25T09:35:52+05:30

933 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements