Sampling & Central Limit Theorem
Master the bridge between samples and populations - from sampling methods to the powerful CLT that underpins all inferential statistics.
π― Population vs Sample
Population
Complete set of all items/individuals of interest.
- Parameter: True value (ΞΌ, Ο)
- Often impractical to measure
- Example: All voters in country
Sample
Subset selected from population.
- Statistic: Estimated value (xΜ, s)
- Practical to measure
- Example: 1000 voters surveyed
π Population β Sample β Inference
Population
All elements
Parameters: ΞΌ, Ο
Sample
Selected subset
Statistics: xΜ, s
Inference
Make conclusions
about population
π’ Real World Example: E-commerce
π― Sampling Methods
Simple Random
Every member has equal chance
β May miss subgroups
Stratified
Divide into strata, sample from each
β Need prior knowledge
Cluster
Randomly select clusters, sample all in cluster
β Less precise
Systematic
Select every kth element
β Periodic bias risk
π Sampling Methods Comparison
Simple Random
Stratified
π― When to Use Which Method?
No prior info, homogeneous
Important subgroups exist
Geographic/cost constraints
Assembly line, queues
β οΈ Bias in Sampling
π― Selection Bias
Sample not representative of population
π£οΈ Response Bias
Answers not truthful or accurate
π Non-response Bias
Participants differ from non-participants
π Bias vs Random Error
Unbiased, Precise
Centered on target, low spread
Biased, Precise
Consistently off-target
π‘οΈ How to Avoid Bias
Equal chance for all
Neutral, unambiguous
Follow up non-respondents
Test survey first
π Sampling Error
π What is Sampling Error?
β Sampling Error
- Due to random chance
- Reduced by larger samples
- Quantifiable with SE
- Always present
- Example: Different samples give different xΜ
β Bias
- Systematic error
- Not reduced by larger samples
- Hard to quantify
- Can be eliminated
- Example: Poor sampling method
π Standard Error Formulas
Ο = population SD
n = sample size
s = sample SD
n = sample size
π― Example: Customer Satisfaction
π Reducing Sampling Error
SE β 1/βn
Reduces variance
Better sampling methods
Estimate variance
π Sampling Distribution
π What is Sampling Distribution?
π― Visual Example: Multiple Samples
π Properties of Sampling Distribution of xΜ
Mean
Mean of sample means equals population mean
Standard Error
Spread decreases with larger n
Distribution
Normal if population normal or n β₯ 30
π Effect of Sample Size on SE
β 0.316Ο
β 0.183Ο
= 0.1Ο
π Central Limit Theorem (CLT)
π The Magic of CLT
π CLT in Action: Any Population β Normal
Skewed Population
n β₯ 30
Sampling Distribution of xΜ
β CLT Conditions
Random Sampling
Samples must be random
Independence
Observations independent (n β€ 10% population)
Sample Size
n β₯ 30 or population normal
π CLT Formulas & Implications
Normal with mean ΞΌ
Standard error Ο/βn
Standard normal distribution
For confidence intervals & testing
π― CLT Example: Website Load Times
Answer: Z = (2.5-2)/0.283 = 1.77 β P β 0.038 (3.8%)
π’ The n β₯ 30 Rule
π€ Why n = 30 Specifically?
π Sample Size Effect on Normality
Not normal
Almost normal
CLT works well
Very normal
β οΈ When n β₯ 30 is NOT Enough
π Highly Skewed Data
For extremely skewed distributions (e.g., income), need n > 100 for CLT to work well.
π Heavy Tails
Distributions with extreme outliers (e.g., financial returns) need larger samples.
π― Small Population
If sampling >10% of population, need finite population correction.
π― Practical Sample Size Guidelines
Any n works
n β₯ 30
n β₯ 100
np β₯ 10, n(1-p) β₯ 10
π§ Data Science Applications
A/B Testing
- Collect sample data (n β₯ 30 per group)
- CLT ensures normal distribution of means
- Calculate p-value using normal approximation
- Make decision with confidence
Confidence Intervals
Quality Control
- Take samples of n items periodically
- Plot sample means over time
- CLT: Means normally distributed
- Set control limits using SE
π’ Case Study: E-commerce A/B Test
π― Problem
Test new checkout button color (red vs blue)
π Sampling
Randomly assign 1000 users to each group (n=1000)
π CLT Application
Conversion rates ~ Normal by CLT
SE = β[p(1-p)/n]
β’ Red: 120/1000 = 12% conversion
β’ Blue: 150/1000 = 15% conversion
β’ Difference: 3% (p-value = 0.02) β Statistically significant!
π― Interview Questions Preview
β Chapter Summary
Sampling
Population β Sample β Inference
Sampling Distribution
Distribution of sample statistics
Central Limit Theorem
n β₯ 30 β Normal distribution