DSPython Logo DSPython

Statistics for Data Science – Normal Distribution

Understand the MOST IMPORTANT distribution in data science: Bell curve properties, empirical rule, Z-scores, and applications in feature scaling, outlier detection, and confidence intervals.

Statistics Core Intermediate → Advanced 55 min

🔥 The MOST IMPORTANT Distribution in Data Science

🎯 Why It's So Important

The Normal Distribution (Gaussian Distribution) is the foundation of statistical inference and machine learning. It appears everywhere in nature, business, and science due to the Central Limit Theorem.

📊

Universal

Appears in almost all natural phenomena

⚖️

Mathematical Simplicity

Easy to work with analytically

🎯

Central Limit Theorem

Sample means become normal

🤖

ML Foundation

Basis for many algorithms

📊 Where You'll Find Normal Distributions

Human Height
Most people near average height
Test Scores
Most scores around class average
Measurement Errors
Small errors more common than large ones
Stock Returns
Daily returns cluster around mean
Data Science Insight: If you understand the Normal Distribution well, you understand half of statistics. It's the foundation for hypothesis testing, confidence intervals, and many machine learning algorithms.

📚 8 Core Concepts

1️⃣

Bell Curve

Characteristic shape

2️⃣

Mean = Median = Mode

Central tendency equality

3️⃣

Symmetry

Perfect mirror image

4️⃣

Empirical Rule

68–95–99.7 rule

5️⃣

Z-score

Standardization

6️⃣

Standard Normal

μ=0, σ=1

7️⃣

Skewness Intro

Asymmetry measure

8️⃣

Real Data Examples

Practical applications

📐 3 Key Formulas

Z = (x − μ) / σ
Z-score Formula
📏
Standardization
🎯
Normalization

📌 Example

Class average = 60, σ = 10

Student score = 80

Z = (80 − 60) / 10 = +2

👉 Student scored 2 standard deviations above average

📊 Bell Curve & Properties

📈

The Bell Curve

Also known as Gaussian curve or Normal curve

μ μ − σ μ + σ

📌 Example: Exam Marks

Suppose marks of 100 students form a normal distribution.

  • Most students score around 70 marks
  • Very few students score below 40 or above 95

👉 This creates a bell-shaped curve with peak at average marks.

⚖️

Mean = Median = Mode

In a perfectly normal distribution, all three measures of central tendency are identical.

Example: If mean height = 170cm, then median = 170cm, and mode = 170cm
🔄

Perfect Symmetry

The left half is a mirror image of the right half around the mean.

Implication: 50% of data is below mean, 50% above mean

📌 Example

Heights (cm): 165, 168, 170, 170, 172

  • Mean = (165+168+170+170+172)/5 = 169
  • Median = 170
  • Mode = 170

👉 In near-normal data, these values almost match.

📏 Effect of Standard Deviation

Small σ (σ=5)
Tall & skinny
Medium σ (σ=10)
Typical bell
Large σ (σ=20)
Short & wide
Key Insight: Standard deviation controls the spread. Smaller σ = data clustered near mean. Larger σ = data spread out.

🎯 Empirical Rule (68–95–99.7)

68-95-99.7

The Golden Rule

For any normal distribution, data falls within these predictable ranges

68% 95% 99.7% μ μ−σ μ+σ μ−2σ μ+2σ μ−3σ μ+3σ
68%
Within 1 standard deviation
μ ± σ
95%
Within 2 standard deviations
μ ± 2σ
99.7%
Within 3 standard deviations
μ ± 3σ
Practical Application: If test scores are normally distributed with μ=75 and σ=10, then:
  • 68% of students scored between 65 and 85
  • 95% of students scored between 55 and 95
  • 99.7% of students scored between 45 and 105

📐 Z-Scores & Standard Normal Distribution

Z

Z-Score Formula

Measures how many standard deviations a value is from the mean

Z = (x − μ) / σ
x = individual value
μ = population mean
σ = population standard deviation

📌 Example

If original marks are converted to Z-scores:

  • Mean becomes 0
  • Standard deviation becomes 1

👉 Now we can directly use Z-tables.

📊 Z-Score Interpretation

Z = 0
Value is exactly at the mean
Z = +1.5
Value is 1.5σ above the mean
Z = -2.0
Value is 2σ below the mean
|Z| > 3
Potential outlier (rare)

🎯 Standard Normal Distribution

The Standard Normal Distribution is a special normal distribution with:

μ = 0
Mean
σ = 1
Standard Deviation

Key Benefit: Any normal distribution can be converted to standard normal using Z-scores. This allows us to use standard normal tables (Z-tables).

🧮 Z-Score Calculation Example

Scenario
Test scores: μ=75, σ=10
Student A
Score = 85
Calculation
Z = (85-75)/10
Result
Z = +1.0
Interpretation
1σ above mean
Another Example: Score = 60 → Z = (60-75)/10 = -1.5 (1.5σ below mean)
Data Science Insight: Z-scores are fundamental for feature scaling in machine learning. Many algorithms (like SVM, K-means, PCA) perform better when features are standardized to have μ=0, σ=1.

📉 Skewness Introduction

⚖️

What is Skewness?

Measure of asymmetry in a distribution

Perfectly Normal: Skewness = 0 (symmetric)

Positive Skew: Right tail longer (mean > median)

Negative Skew: Left tail longer (mean < median)

📌 Example

If average salary = ₹40,000

  • 50% employees earn below ₹40,000
  • 50% employees earn above ₹40,000

👉 Distribution is perfectly balanced around the mean.

Positive Skew (+)

Zero Skew

Negative Skew (−)

Platykurtic

Mesokurtic

Leptokurtic

📌 Example: Income

Most people earn around ₹30,000

Few people earn ₹5,00,000+

👉 Right tail becomes longer → Positive Skew

📌 Example

  • Platykurtic: Exam paper very easy → marks spread out
  • Mesokurtic: Normal paper → typical bell curve
  • Leptokurtic: Very tough paper → marks concentrated near mean

🏢 Real-World Skewed Distributions

Income Distribution
Positive skew (few very high incomes)
House Prices
Positive skew (few expensive houses)
Age at Death
Negative skew (few very young deaths)
Exam Scores
Often normal or slightly negative skew
Data Transformation Tip: When data is skewed, we often apply transformations (log, square root) to make it more normal before applying statistical tests or ML algorithms that assume normality.

🧠 Data Science Applications

⚖️

Feature Scaling

Standardizing features to μ=0, σ=1 using Z-scores.

Used in: SVM, K-means, PCA, Neural Networks
🎯

Outlier Detection

Using Z-scores to identify unusual values (|Z| > 3).

Example: Fraud detection, anomaly detection
📊

Confidence Intervals

Constructing intervals using normal distribution properties.

Example: 95% CI = mean ± 1.96×SE

🤖 Machine Learning Algorithms Using Normal Distribution

Linear Regression
Assumes normally distributed errors
Gaussian Naive Bayes
Assumes features follow normal distribution
Gaussian Processes
Use multivariate normal distributions
Anomaly Detection
Based on deviation from normal patterns

🏢 Real-World Example: Feature Scaling for ML

Feature 1
Age: μ=35, σ=10
Feature 2
Income: μ=50000, σ=20000
Problem
Different scales bias models
Solution
Z-score standardization

After Standardization: Both features have μ=0, σ=1. This prevents income from dominating age in distance-based algorithms like K-means or SVM.

⚠️ Outlier Detection Using Z-scores

Scenario: Credit card transaction amounts normally distributed with μ=$50, σ=$15

Transaction
$200
Z-score
Z = (200-50)/15 = 10
Interpretation
10σ above mean!
Action
Flag for fraud review

Rule: Typically flag transactions with |Z| > 3 as potential outliers (beyond 99.7% of normal transactions).

📐 Key Formulas

Z

Z-Score Formula

Z = (x − μ) / σ

Where:
x = individual value
μ = population mean
σ = population standard deviation

Interpretation: Z = 1.5 means value is 1.5 standard deviations above mean
68-95-99.7

Empirical Rule

P(μ − σ ≤ X ≤ μ + σ) ≈ 0.68
P(μ − 2σ ≤ X ≤ μ + 2σ) ≈ 0.95
P(μ − 3σ ≤ X ≤ μ + 3σ) ≈ 0.997

For any normal distribution:
68% within ±1σ, 95% within ±2σ, 99.7% within ±3σ

N(0,1)

Standard Normal

X ~ N(μ, σ²)
Z = (X − μ)/σ ~ N(0, 1)

Any normal distribution X can be standardized to Z
Z follows standard normal distribution
μ=0, σ=1

💡 Pro Tip: Memorize these critical Z-values: Z=1.96 gives 95% confidence, Z=2.576 gives 99% confidence. These are used constantly in hypothesis testing.

📌 Example: IQ Scores

IQ scores are normally distributed with:

  • Mean (μ) = 100
  • Standard Deviation (σ) = 15
  • 68% people have IQ between 85 and 115
  • 95% people have IQ between 70 and 130
  • 99.7% people have IQ between 55 and 145

👉 Values outside this range are extremely rare.

✅ Chapter Summary

🔥

Core Purpose

MOST IMPORTANT distribution in data science.

📚

8 Key Concepts

Bell curve, mean=median=mode, symmetry, empirical rule, Z-score, standard normal, skewness, real examples.

📐

3 Key Formulas

Z = (x − μ) / σ plus empirical rule values.

🧠

Data Science Use

Feature scaling, outlier detection, confidence intervals.

📋 Quick Reference Guide

Z = (x−μ)/σ 68% within ±1σ 95% within ±2σ 99.7% within ±3σ Standard Normal: μ=0, σ=1 Mean = Median = Mode |Z| > 3 → Outlier
🤖
DSPython AI Assistant
👋 Hi! I’m your AI assistant. Paste your code here, I will find bugs for you.