DSPython Logo DSPython

Sampling & Central Limit Theorem

Master the bridge between samples and populations - from sampling methods to the powerful CLT that underpins all inferential statistics.

Inferential Statistics Foundation Advanced Level 50 min

🎯 Population vs Sample

Population

Complete set of all items/individuals of interest.

  • Parameter: True value (ΞΌ, Οƒ)
  • Often impractical to measure
  • Example: All voters in country

Sample

Subset selected from population.

  • Statistic: Estimated value (xΜ„, s)
  • Practical to measure
  • Example: 1000 voters surveyed

πŸ“Š Population β†’ Sample β†’ Inference

Population

All elements
Parameters: ΞΌ, Οƒ

↓

Sample

Selected subset
Statistics: xΜ„, s

β†’
πŸ“ˆ

Inference

Make conclusions
about population

🏒 Real World Example: E-commerce

Population: All 1M customers
True conversion rate: ΞΌ = 3.2%
Sample: 1000 customers surveyed
Sample rate: xΜ„ = 3.5%
Inference: Estimate true rate
95% CI: [3.1%, 3.9%]

🎯 Sampling Methods

🎯

Simple Random

Every member has equal chance

Example: Random number generator
βœ… No bias
❌ May miss subgroups
πŸ“Š

Stratified

Divide into strata, sample from each

Example: Age groups 18-25, 26-40, 41+
βœ… Represents all groups
❌ Need prior knowledge
πŸ™οΈ

Cluster

Randomly select clusters, sample all in cluster

Example: Random cities, survey all schools
βœ… Cost-effective
❌ Less precise
πŸ”’

Systematic

Select every kth element

Example: Every 10th customer
βœ… Simple to implement
❌ Periodic bias risk

πŸ“Š Sampling Methods Comparison

Simple Random

Random selection

Stratified

By subgroups

🎯 When to Use Which Method?

🎯
Simple Random
No prior info, homogeneous
πŸ“Š
Stratified
Important subgroups exist
πŸ™οΈ
Cluster
Geographic/cost constraints
πŸ”’
Systematic
Assembly line, queues

⚠️ Bias in Sampling

🎯 Selection Bias

Sample not representative of population

Example: Phone survey misses people without phones
Solution: Use appropriate sampling frame

πŸ—£οΈ Response Bias

Answers not truthful or accurate

Example: Social desirability bias in surveys
Solution: Anonymous surveys, neutral questions

πŸ“­ Non-response Bias

Participants differ from non-participants

Example: Online survey only reaches tech-savvy
Solution: Follow-ups, incentives

πŸ“Š Bias vs Random Error

Unbiased, Precise

Centered on target, low spread

Biased, Precise

Consistently off-target

Key Insight: Bias = Systematic error | Random error = Natural variation

πŸ›‘οΈ How to Avoid Bias

🎯
Random Sampling
Equal chance for all
πŸ“‹
Clear Questions
Neutral, unambiguous
πŸ“Š
Adequate Response Rate
Follow up non-respondents
πŸ”
Pilot Testing
Test survey first

πŸ“Š Sampling Error

πŸ“– What is Sampling Error?

Error = xΜ„ - ΞΌ
Sample mean - Population mean
Definition: Difference between sample statistic and population parameter due to random chance.

βœ… Sampling Error

  • Due to random chance
  • Reduced by larger samples
  • Quantifiable with SE
  • Always present
  • Example: Different samples give different xΜ„

❌ Bias

  • Systematic error
  • Not reduced by larger samples
  • Hard to quantify
  • Can be eliminated
  • Example: Poor sampling method

πŸ“ Standard Error Formulas

SE = Οƒ / √n
Population Οƒ known
Οƒ = population SD
n = sample size
SE β‰ˆ s / √n
Population Οƒ unknown
s = sample SD
n = sample size
SE measures precision: Smaller SE β†’ More precise estimate

🎯 Example: Customer Satisfaction

Scenario: Survey customer satisfaction (scale 1-10)
ΞΌ
True population mean
7.2
Οƒ
Population SD
1.5
n
Sample size
100
Standard Error: SE = 1.5 / √100 = 1.5 / 10 = 0.15
Interpretation: Sample means will typically vary by about 0.15 from true population mean.

πŸ“‰ Reducing Sampling Error

πŸ“ˆ
Larger n
SE ∝ 1/√n
🎯
Stratified Sampling
Reduces variance
πŸ“Š
Efficient Design
Better sampling methods
πŸ”
Pilot Studies
Estimate variance

πŸ“Š Sampling Distribution

πŸ“– What is Sampling Distribution?

xΜ„'s
Distribution of sample means
Definition: Probability distribution of a statistic (like sample mean) obtained from all possible samples of size n from population.

🎯 Visual Example: Multiple Samples

Population: Mean = 50, SD = 10 | Sample size: n = 30
Population Distribution ΞΌ = 50, Οƒ = 10
ΞΌ = 50
↓ Take multiple samples (n=30)
Sampling Distribution of xΜ„ ΞΌ_xΜ„ = 50, Οƒ_xΜ„ = 10/√30 β‰ˆ 1.83
ΞΌ_xΜ„ = 50
Narrower spread!

πŸ“Š Properties of Sampling Distribution of xΜ„

ΞΌ_xΜ„ = ΞΌ

Mean

Mean of sample means equals population mean

Οƒ_xΜ„ = Οƒ/√n

Standard Error

Spread decreases with larger n

Shape

Distribution

Normal if population normal or n β‰₯ 30

πŸ“ˆ Effect of Sample Size on SE

n = 10
SE = Οƒ/√10
β‰ˆ 0.316Οƒ
n = 30
SE = Οƒ/√30
β‰ˆ 0.183Οƒ
n = 100
SE = Οƒ/√100
= 0.1Οƒ
πŸ’‘ Key Insight: To halve SE, need to quadruple n! (SE ∝ 1/√n)

🌟 Central Limit Theorem (CLT)

πŸ“– The Magic of CLT

CLT
Statistical Superpower
Theorem: For large enough sample size (n β‰₯ 30), sampling distribution of sample mean approaches normal distribution, regardless of population distribution.

πŸ“Š CLT in Action: Any Population β†’ Normal

Skewed Population

Right-skewed distribution
β†’
Take samples
n β‰₯ 30

Sampling Distribution of xΜ„

Normal distribution!
Magic: Even from skewed population, sample means become normally distributed!

βœ… CLT Conditions

🎯

Random Sampling

Samples must be random

πŸ“Š

Independence

Observations independent (n ≀ 10% population)

πŸ”’

Sample Size

n β‰₯ 30 or population normal

πŸ“ CLT Formulas & Implications

xΜ„ ~ N(ΞΌ, Οƒ/√n)
Sample mean distribution
Normal with mean ΞΌ
Standard error Οƒ/√n
Z = (xΜ„ - ΞΌ) / (Οƒ/√n)
Z-score for sample mean
Standard normal distribution
For confidence intervals & testing

🎯 CLT Example: Website Load Times

Scenario: Website load times are exponentially distributed with mean 2 seconds.
ΞΌ
Population mean
2 seconds
Οƒ
Population SD
2 seconds
n
Sample size
50
CLT Application: Even though load times are exponential, sample mean of 50 measurements ~ Normal(ΞΌ=2, SE=2/√50β‰ˆ0.283)
Question: P(average load time > 2.5 seconds)?
Answer: Z = (2.5-2)/0.283 = 1.77 β†’ P β‰ˆ 0.038 (3.8%)

πŸ”’ The n β‰₯ 30 Rule

πŸ€” Why n = 30 Specifically?

30
Magic Number
Reason: At n=30, Student's t-distribution closely approximates normal distribution (difference < 5%). Provides good balance between practicality and accuracy.

πŸ“ˆ Sample Size Effect on Normality

n = 5
Still skewed
Not normal
n = 15
Getting closer
Almost normal
n = 30
Good approximation
CLT works well
n = 50
Excellent
Very normal

⚠️ When n β‰₯ 30 is NOT Enough

πŸ“Š Highly Skewed Data

For extremely skewed distributions (e.g., income), need n > 100 for CLT to work well.

πŸ“‰ Heavy Tails

Distributions with extreme outliers (e.g., financial returns) need larger samples.

🎯 Small Population

If sampling >10% of population, need finite population correction.

🎯 Practical Sample Size Guidelines

πŸ“Š
Normal Population
Any n works
🎯
Moderate Skew
n β‰₯ 30
⚠️
Extreme Skew
n β‰₯ 100
πŸ“ˆ
Binary Data
np β‰₯ 10, n(1-p) β‰₯ 10

🧠 Data Science Applications

πŸ“Š

A/B Testing

CLT Application: Compare conversion rates between groups
Process:
  • Collect sample data (n β‰₯ 30 per group)
  • CLT ensures normal distribution of means
  • Calculate p-value using normal approximation
  • Make decision with confidence
Without CLT: Need complex non-parametric tests
🎯

Confidence Intervals

CLT Application: Estimate population parameters
Formula (95% CI):
xΜ„ Β± 1.96 Γ— (s/√n)
Why it works: CLT guarantees xΜ„ ~ Normal, allowing Z-scores
🏭

Quality Control

Sampling Application: Monitor production processes
Control Charts:
  • Take samples of n items periodically
  • Plot sample means over time
  • CLT: Means normally distributed
  • Set control limits using SE
Example: Detect machine malfunction early

🏒 Case Study: E-commerce A/B Test

🎯 Problem

Test new checkout button color (red vs blue)

πŸ“Š Sampling

Randomly assign 1000 users to each group (n=1000)

πŸ“ˆ CLT Application

Conversion rates ~ Normal by CLT
SE = √[p(1-p)/n]

Results:
β€’ Red: 120/1000 = 12% conversion
β€’ Blue: 150/1000 = 15% conversion
β€’ Difference: 3% (p-value = 0.02) β†’ Statistically significant!

🎯 Interview Questions Preview

Q: Why is n=30 considered the magic number for CLT?
A: At n=30, t-distribution approximates normal within 5%. Balance between practicality and statistical accuracy.
Q: When would you NOT use CLT even with n=30?
A: Extremely skewed distributions, heavy-tailed data, or when sampling >10% of population.

βœ… Chapter Summary

🎯

Sampling

Population β†’ Sample β†’ Inference

Bias vs Random Error
πŸ“Š

Sampling Distribution

Distribution of sample statistics

ΞΌ_xΜ„ = ΞΌ, Οƒ_xΜ„ = Οƒ/√n
🌟

Central Limit Theorem

n β‰₯ 30 β†’ Normal distribution

xΜ„ ~ N(ΞΌ, Οƒ/√n)

⚑ Key Formulas

SE = Οƒ/√n ΞΌ_xΜ„ = ΞΌ 95% CI: xΜ„ Β± 1.96Γ—SE Z = (xΜ„-ΞΌ)/SE n β‰₯ 30 rule

πŸ’‘ Practical Tips

🎯
Always check for bias
πŸ“Š
n=30 for CLT
⚠️
Watch for extreme skew
πŸ“ˆ
SE decreases with √n
πŸ€–
DSPython AI Assistant βœ–
πŸ‘‹ Hi! I’m your AI assistant. Paste your code here, I will find bugs for you.