DSPython Logo DSPython
Statistics for Data Science

ANOVA

Analysis of Variance — Comparing more than two groups at once 📊⚖️

1️⃣ What is ANOVA?

ANOVA (Analysis of Variance) is a statistical test used to compare the means of three or more groups to see if at least one of them is significantly different from the others.

🤔 Why not just use multiple t-tests?

If you compare Group A vs B, B vs C, and A vs C separately, your risk of making a Type I Error (False Positive) increases drastically. ANOVA does it all in one go with a single alpha level (e.g., 0.05).

2️⃣ The Logic: F-Statistic

ANOVA compares two types of variance. If the "Between" variance is much larger than the "Within" variance, the groups are likely different.

Variance Between Groups

How different are the group means from each other?

↔️

Variance Within Groups

How spread out is the data inside each group (noise)?

📉
F = Variance Between Groups / Variance Within Groups

Large F-value → High probability that groups are different.

3️⃣ Visualizing Variance

Low F-score (Groups Overlap)

A B C

Means are close, noise is high. Fail to Reject H₀.

High F-score (Distinct Groups)

A B C

Means are far apart. Reject H₀ (Significant).

🥗

4️⃣ Case Study: Which Diet Works?

Comparing weight loss (kg) across 3 different diets.

The Data (Weight Loss in kg)

Diet A (Keto): [4, 5, 6] Mean = 5
Diet B (Vegan): [10, 11, 12] Mean = 11
Diet C (Low Carb): [16, 17, 18] Mean = 17

Grand Mean (Overall) = 11

Maths Breakdown 🧮

1. Variance Between (SSB)

How far are group means (5, 11, 17) from Grand Mean (11)?
Sum = 3*((5-11)² + (11-11)² + (17-11)²) = 3*(36+0+36) = 216

2. Variance Within (SSW)

Variation inside groups (e.g., 4,5,6 -> variance is small).
Sum of squared diffs inside groups = 2 + 2 + 2 = 6

3. F-Statistic

MS_Between = 216 / 2 = 108
MS_Within = 6 / 6 = 1
F = 108 / 1 = 108.0

Python Solution scipy.stats
import scipy.stats as stats

# 1. The Data
diet_a = [4, 5, 6]
diet_b = [10, 11, 12]
diet_c = [16, 17, 18]

# 2. Perform One-Way ANOVA
f_stat, p_val = stats.f_oneway(diet_a, diet_b, diet_c)

print(f"F-statistic: {f_stat:.2f}")
print(f"P-value: {p_val:.10f}")
Output
> F-statistic: 108.00
> P-value: 0.0000001

Conclusion: p < 0.05. We reject H₀. There is a significant difference between the diets.

🔎 What happens after you reject H₀?

ANOVA only tells you that there is a difference, not where it is (e.g., is A different from B, or B from C?).
To find out, we use Post-Hoc Tests like Tukey's HSD (Honest Significant Difference).

Interview Checkpoint 🎯

1. When should I use ANOVA instead of t-test?
Use ANOVA when you have a categorical independent variable with 3 or more levels (groups) and a quantitative dependent variable.
2. What are the assumptions of ANOVA?
1. Normality: Data in each group is normally distributed.
2. Homogeneity of Variance: Variances in groups are roughly equal.
3. Independence: Observations are independent of each other.
3. What does a high F-value mean?
It means the variation between the groups is much larger than the variation within the groups. This suggests the groups are truly different.
🤖
DSPython AI Assistant
👋 Hi! I’m your AI assistant. Paste your code here, I will find bugs for you.