Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Find combined mean and variance of two series in Python
When working with two separate data series, you often need to find the combined mean and combined variance of the merged dataset. This is useful in statistics when combining samples from different populations or datasets.
The combined mean is calculated as a weighted average of individual means, while the combined variance uses the formula that accounts for both individual variances and the differences between individual means and the combined mean.
Mathematical Formula
For two series A1 and A2 with sizes n and m respectively:
Combined Mean: (n × mean? + m × mean?) / (n + m)
Combined Variance: [n × (var? + d?²) + m × (var? + d?²)] / (n + m)
Where d?² = (mean? - combined_mean)² and d?² = (mean? - combined_mean)²
Implementation
def mean(arr):
return sum(arr) / len(arr)
def variance(arr, n):
total = 0
arr_mean = mean(arr)
for i in range(n):
total = total + ((arr[i] - arr_mean) * (arr[i] - arr_mean))
var = total / n
return var
def combined_statistics(series1, series2):
n = len(series1)
m = len(series2)
# Calculate individual means
mean1 = mean(series1)
mean2 = mean(series2)
print("Mean 1:", round(mean1, 2), " Mean 2:", round(mean2, 2))
# Calculate individual variances
var1 = variance(series1, n)
var2 = variance(series2, m)
print("Variance 1:", round(var1, 2), " Variance 2:", round(var2, 2))
# Calculate combined mean
combined_mean = (n * mean1 + m * mean2) / (n + m)
print("Combined Mean:", round(combined_mean, 2))
# Calculate squared differences from combined mean
d1_square = (mean1 - combined_mean) ** 2
d2_square = (mean2 - combined_mean) ** 2
print("d1_square:", round(d1_square, 2), " d2_square:", round(d2_square, 2))
# Calculate combined variance
combined_var = (n * (var1 + d1_square) + m * (var2 + d2_square)) / (n + m)
print("Combined Variance:", round(combined_var, 2))
# Example usage
series1 = [24, 46, 35, 79, 13, 77, 35]
series2 = [66, 68, 35, 24, 46]
combined_statistics(series1, series2)
Mean 1: 44.14 Mean 2: 47.8 Variance 1: 548.69 Variance 2: 294.56 Combined Mean: 45.67 d1_square: 2.32 d2_square: 4.55 Combined Variance: 446.06
Alternative Using NumPy
For simpler implementation, you can use NumPy functions ?
import numpy as np
def combined_stats_numpy(series1, series2):
# Convert to numpy arrays
arr1 = np.array(series1)
arr2 = np.array(series2)
# Individual statistics
mean1, mean2 = np.mean(arr1), np.mean(arr2)
var1, var2 = np.var(arr1), np.var(arr2)
n, m = len(arr1), len(arr2)
# Combined statistics
combined_mean = (n * mean1 + m * mean2) / (n + m)
d1_sq = (mean1 - combined_mean) ** 2
d2_sq = (mean2 - combined_mean) ** 2
combined_var = (n * (var1 + d1_sq) + m * (var2 + d2_sq)) / (n + m)
print(f"Combined Mean: {combined_mean:.2f}")
print(f"Combined Variance: {combined_var:.2f}")
series1 = [24, 46, 35, 79, 13, 77, 35]
series2 = [66, 68, 35, 24, 46]
combined_stats_numpy(series1, series2)
Combined Mean: 45.67 Combined Variance: 446.06
Key Points
The combined mean is a weighted average based on sample sizes
Combined variance accounts for both individual variances and mean differences
The d?² and d?² terms represent how much each individual mean deviates from the combined mean
This method is mathematically equivalent to calculating variance on the merged dataset
Conclusion
Combining statistics from multiple series is essential in data analysis. The weighted formulas ensure accurate results without needing to merge the actual datasets, making it memory-efficient for large datasets.
---