Statistics for Data Science β Probability Distributions
Understand random variable behavior through discrete & continuous distributions: PMF vs PDF, CDF, and their applications in feature distribution and noise modeling.
π₯ Random Variable Behavior
π― Core Purpose
Probability Distributions describe how probabilities are distributed over the values of a random variable. They help us model and predict uncertain outcomes in data science.
π² Types of Random Variables
Discrete Random Variable
Countable outcomes
Examples:
Dice rolls, Customer counts
Continuous Random Variable
Measurable outcomes
Examples:
Height, Temperature, Time
π 6 Core Concepts
Random Variable
Numerical outcome of random process
Discrete Distribution
Countable outcomes (dice, counts)
Continuous Distribution
Measurable outcomes (height, time)
PMF vs PDF
Probability functions for each type
CDF
Cumulative Distribution Function
Real-world Meaning
Practical applications in DS
π― Learning Focus
π² Random Variables
Definition
A random variable is a function that assigns numerical values to outcomes of a random process.
Y: Customer arrival count
Z: Temperature reading
Real Examples
π Random Variable Process
Coin toss
Dice roll
Customer arrival
X = 0 or 1
X = {1,2,...,6}
X = arrival count
P(X=0) = 0.5
P(X=1) = 1/6
P(X=k) = ?
π Discrete Distributions
Characteristics
Discrete Distributions describe random variables that take countable values.
Finite or countably infinite values
No values between integers
Individual probabilities for each value
π― Common Discrete Distributions
Bernoulli
Single trial, two outcomes
Binomial
n independent Bernoulli trials
Poisson
Events in fixed interval
Geometric
Trials until first success
π PMF (Probability Mass Function)
π Continuous Distributions
Characteristics
Continuous Distributions describe random variables that can take any value in an interval.
Infinite possible values
Continuous range of values
Area under curve = probability
π― Common Continuous Distributions
Normal
Bell-shaped, symmetric
Exponential
Time between events
Uniform
Equal probability across range
Beta
Probabilities of probabilities
π PDF (Probability Density Function)
Key Difference: P(X = x) = 0 for continuous variables! We work with intervals: P(a β€ X β€ b)
βοΈ PMF vs PDF
Probability Mass Function
For Discrete Variables
Gives probability for each specific value:
- 0 β€ P(X=x) β€ 1 for each x
- Ξ£ P(X=x) = 1
- Direct probability values
Probability Density Function
For Continuous Variables
Gives density, not direct probability:
- f(x) β₯ 0 for all x
- β« f(x) dx = 1
- P(X = x) = 0 for any specific x
π PMF vs PDF Comparison
| Aspect | PMF (Discrete) | PDF (Continuous) |
|---|---|---|
| Values | Probabilities | Densities |
| Sum/Integral | Ξ£ P(X=x) = 1 | β« f(x) dx = 1 |
| Specific Value | P(X=5) = 0.3 | P(X=5) = 0 |
| Visualization | Bar chart | Smooth curve |
π CDF - Cumulative Distribution Function
Definition
F(x) = P(X β€ x) - Probability that X is less than or equal to x
Works for Both Types
π CDF Properties & Visualization
Discrete CDF (Step Function)
Jumps at each possible value
Flat between values
Continuous CDF (Smooth Curve)
Smooth, continuous curve
Always non-decreasing
π CDF Key Properties
F(x) never decreases as x increases
F(-β) = 0, F(β) = 1
No sudden drops
P(a < X β€ b) = F(b) - F(a)
π§ Data Science Applications
Feature Distribution
Understanding feature distributions helps choose appropriate models and preprocessing.
Noise Modeling
Modeling measurement errors and random noise in data.
Bayesian Inference
Using prior distributions to update beliefs with new data.
Anomaly Detection
Identifying outliers using distribution tails.
π€ Machine Learning Use Cases
Assumes feature independence with specific distributions
Use Normal distributions for uncertainty
Generalized Linear Models use specific error distributions
Sampling from distributions for simulations
π’ Real-World Scenario: E-commerce Click-Through Rates
Analysis: Click-through rates (CTR) follow Beta distribution. With 1000 impressions and 25 clicks, we can model CTR and compare different ad versions using Bayesian inference.
π Conceptual Formulas
Probability Mass Function
For discrete random variables
0 β€ p(x) β€ 1 for each x
Ξ£ p(x) = 1
Probability Density Function
For continuous random variables
f(x) β₯ 0 for all x
β« f(x) dx = 1
Cumulative Distribution Function
Works for both discrete & continuous
0 β€ F(x) β€ 1
F(-β) = 0, F(β) = 1
π‘ Focus: Understand concepts, not heavy derivations. Know when to use PMF vs PDF, and how CDF helps with probability calculations.
β Chapter Summary
Core Purpose
Model random variable behavior and uncertainty.
6 Key Concepts
Random variables, discrete/continuous distributions, PMF vs PDF, CDF, real-world meaning.
Conceptual Formulas
Focus on understanding PMF, PDF, and CDF relationships.
Data Science Use
Feature distribution analysis and noise modeling.