Statistics for Data Science – Measures of Dispersion
Understand how to measure data spread/variability: Range, Variance, Standard Deviation, IQR, and their applications in risk analysis, model stability, and volatility detection.
🔥 Why Dispersion Matters?
🎯 Core Purpose
Dispersion measures how spread out or scattered the data values are from the central tendency (mean/median). It tells us about the variability or consistency of the data.
📊 Same Mean, Different Spread
Low Dispersion
Data tightly clustered
Mean = 50, SD = 5
Medium Dispersion
Moderate spread
Mean = 50, SD = 15
High Dispersion
Widely spread data
Mean = 50, SD = 30
📚 Types of Dispersion Measures
Absolute Measures
Expressed in original units
(Range, Variance, SD, MD)
Relative Measures
Expressed as ratios/percentages
(Coefficient of Variation)
Positional Measures
Based on data positions
(Quartiles, IQR, Percentiles)
📋 7 Key Dispersion Concepts
Simplest measure
Average squared deviation
Most widely used
Average absolute deviation
Middle 50% range
Risk & consistency
Visual comparison
📏 Range
Formula
📊 Range Visualization
✅ Advantages
- Simple to calculate
- Easy to understand
- Quick measure of spread
❌ Limitations
- Sensitive to outliers
- Ignores data distribution
- Based only on two values
Range: 85 - 15 = 70 units
📊 Variance & Standard Deviation
Variance
Formula
Average of squared deviations from the mean. Larger variance = more spread.
Standard Deviation
Formula
Square root of variance. In original units. Most commonly used dispersion measure.
🧮 Step-by-Step Calculation
📋 Key Differences
| Aspect | Variance (σ²) | Standard Deviation (σ) |
|---|---|---|
| Units | Squared units | Original units |
| Interpretation | Harder to interpret | Easy to understand |
| Use in ML | Feature scaling | Outlier detection |
| Sensitivity | More sensitive to outliers | Less sensitive (√ effect) |
📉 Mean Deviation
Definition
Mean Deviation (or Mean Absolute Deviation) is the average of absolute deviations from the mean or median.
Formula
where |x − μ| is absolute value
💡 Why Use Absolute Values?
Using absolute values prevents positive and negative deviations from canceling each other out.
Without abs: (-5 + 5)/2 = 0 (wrong!)
With abs: (5 + 5)/2 = 5 (correct!)
📊 Comparison with Variance
- MD: Uses absolute values
- Variance: Uses squared values
- MD: Less sensitive to outliers
- SD: More mathematically convenient
🏢 Business Application
Sales Forecasting Error: Mean Deviation measures average forecasting error magnitude without considering direction.
Example: Actual vs Forecasted sales errors: [-200, +300, -100, +400]
MD: (200+300+100+400)/4 = 250 units average error
📊 IQR & Quartiles
Interquartile Range
Range of the middle 50% of data
Formulas
📦 Box Plot (Box & Whisker) Visualization
🚀 Data Science Applications
Risk Analysis
High variance in financial returns indicates higher investment risk.
Model Stability
Low variance in model predictions indicates consistent performance.
Volatility Detection
Standard deviation measures price volatility in stock markets.
Quality Control
Low process variance indicates consistent manufacturing quality.
🤖 Machine Learning Use Cases
Standardization uses SD
IQR & Z-score methods
Cross-validation variance
Variance reduction
🏢 Real-World Scenario: E-commerce Delivery Times
Business Insight: While average delivery is 2.5 days, the SD of 0.8 days indicates variability. IQR shows middle 50% of deliveries take 1.9–3.1 days.
📐 Key Dispersion Formulas
Range
Where:
Max = Maximum value
Min = Minimum value
Variance
Population variance formula
Sample variance: s² = Σ(x − x̄)² / (n−1)
Standard Deviation
Square root of variance
Returns to original units
Interquartile Range
Where:
Q1 = First quartile (25%)
Q3 = Third quartile (75%)
💡 Pro Tip: For sample data, use (n-1) in denominator for unbiased variance estimation
✅ Chapter Summary
Core Purpose
Measure data spread/variability beyond central tendency.
7 Key Concepts
Range, Variance, SD, Mean Deviation, IQR, Importance, Visual comparison.
5 Key Formulas
Range, Variance, SD, IQR, Mean Deviation formulas.
Data Science Use
Risk analysis, model stability, volatility detection.