DSPython Logo DSPython

Statistics for Data Science – Probability Distributions

Understand random variable behavior through discrete & continuous distributions: PMF vs PDF, CDF, and their applications in feature distribution and noise modeling.

Statistics Basics Intermediate 50 min

πŸ”₯ Random Variable Behavior

🎯 Core Purpose

Probability Distributions describe how probabilities are distributed over the values of a random variable. They help us model and predict uncertain outcomes in data science.

🎲 Types of Random Variables

Discrete Random Variable

Countable outcomes

Examples:
Dice rolls, Customer counts

Continuous Random Variable

Measurable outcomes

Examples:
Height, Temperature, Time

Key Insight: Probability distributions help us answer questions like: "What's the probability that our model's prediction error exceeds 5%?" or "How likely are we to get 100+ customers today?"

πŸ“š 6 Core Concepts

1️⃣

Random Variable

Numerical outcome of random process

2️⃣

Discrete Distribution

Countable outcomes (dice, counts)

3️⃣

Continuous Distribution

Measurable outcomes (height, time)

4️⃣

PMF vs PDF

Probability functions for each type

5️⃣

CDF

Cumulative Distribution Function

6️⃣

Real-world Meaning

Practical applications in DS

🎯 Learning Focus

πŸ“
No Heavy Derivations
🎯
Focus on Understanding
πŸ’‘
Conceptual Formulas
πŸš€
Practical Applications

🎲 Random Variables

X

Definition

A random variable is a function that assigns numerical values to outcomes of a random process.

Notation: X, Y, Z (capital letters)
X: Outcome of dice roll
Y: Customer arrival count
Z: Temperature reading
🎯

Real Examples

E-commerce: X = Daily sales amount
Healthcare: Y = Patient recovery time
Finance: Z = Stock price change
Manufacturing: W = Product defect count

πŸ”„ Random Variable Process

Random Process

Coin toss
Dice roll
Customer arrival

β†’
Random Variable

X = 0 or 1
X = {1,2,...,6}
X = arrival count

β†’
Distribution

P(X=0) = 0.5
P(X=1) = 1/6
P(X=k) = ?

Data Science Connection: Every feature in your dataset can be thought of as a random variable with its own probability distribution. Understanding these distributions helps with feature engineering and model selection.

πŸ“Š Discrete Distributions

πŸ”’

Characteristics

Discrete Distributions describe random variables that take countable values.

Countable
Finite or countably infinite values
Gaps
No values between integers
Probability Mass
Individual probabilities for each value

🎯 Common Discrete Distributions

🎲

Bernoulli

Single trial, two outcomes

Example: Coin toss
πŸ“Š

Binomial

n independent Bernoulli trials

Example: Success count
πŸ“ˆ

Poisson

Events in fixed interval

Example: Customer arrivals
πŸ“¦

Geometric

Trials until first success

Example: Retry attempts

πŸ“Š PMF (Probability Mass Function)

P(X=1) = 0.1
P(X=2) = 0.3
P(X=3) = 0.4
P(X=4) = 0.2
PMF Property: Ξ£ P(X=x) = 1 (Sum of all probabilities = 1)

πŸ“ˆ Continuous Distributions

πŸ“

Characteristics

Continuous Distributions describe random variables that can take any value in an interval.

Uncountable
Infinite possible values
No Gaps
Continuous range of values
Probability Density
Area under curve = probability

🎯 Common Continuous Distributions

πŸ“Š

Normal

Bell-shaped, symmetric

Example: Heights
⏱️

Exponential

Time between events

Example: Wait times
πŸ“

Uniform

Equal probability across range

Example: Random numbers
πŸ“‰

Beta

Probabilities of probabilities

Example: A/B testing

πŸ“Š PDF (Probability Density Function)

a b P(a ≀ X ≀ b)
PDF Property: ∫ f(x) dx = 1 (Total area under curve = 1)
Key Difference: P(X = x) = 0 for continuous variables! We work with intervals: P(a ≀ X ≀ b)

βš–οΈ PMF vs PDF

PMF

Probability Mass Function

For Discrete Variables

Gives probability for each specific value:

P(X = x)
Properties:
  • 0 ≀ P(X=x) ≀ 1 for each x
  • Ξ£ P(X=x) = 1
  • Direct probability values
PDF

Probability Density Function

For Continuous Variables

Gives density, not direct probability:

P(a ≀ X ≀ b) = ∫ f(x) dx
Properties:
  • f(x) β‰₯ 0 for all x
  • ∫ f(x) dx = 1
  • P(X = x) = 0 for any specific x

πŸ“‹ PMF vs PDF Comparison

Aspect PMF (Discrete) PDF (Continuous)
Values Probabilities Densities
Sum/Integral Σ P(X=x) = 1 ∫ f(x) dx = 1
Specific Value P(X=5) = 0.3 P(X=5) = 0
Visualization Bar chart Smooth curve
Remember: PDF gives density, not probability. The probability is the area under the PDF curve between two points.

πŸ“Š CDF - Cumulative Distribution Function

CDF

Definition

F(x) = P(X ≀ x) - Probability that X is less than or equal to x

Works for Both Types

Discrete: F(x) = Ξ£ P(X = t) for t ≀ x
Continuous: F(x) = ∫ f(t) dt from -∞ to x

πŸ“ˆ CDF Properties & Visualization

Discrete CDF (Step Function)

Jumps at each possible value
Flat between values

Continuous CDF (Smooth Curve)

Smooth, continuous curve
Always non-decreasing

πŸ”‘ CDF Key Properties

1. Non-decreasing
F(x) never decreases as x increases
2. Limits
F(-∞) = 0, F(∞) = 1
3. Right-continuous
No sudden drops
4. Probability from CDF
P(a < X ≀ b) = F(b) - F(a)
Data Science Use: CDFs are excellent for comparing distributions, calculating percentiles, and understanding data spread. Many statistical tests and ML algorithms use CDF-based calculations.

🧠 Data Science Applications

πŸ“Š

Feature Distribution

Understanding feature distributions helps choose appropriate models and preprocessing.

Example: Normally distributed features work well with linear models
🎯

Noise Modeling

Modeling measurement errors and random noise in data.

Example: Gaussian noise in sensor readings
πŸ€–

Bayesian Inference

Using prior distributions to update beliefs with new data.

Example: A/B testing with Beta distributions
πŸ“ˆ

Anomaly Detection

Identifying outliers using distribution tails.

;color:#6b7280;"> Example: Z-scores from Normal distribution

πŸ€– Machine Learning Use Cases

Naive Bayes
Assumes feature independence with specific distributions
Gaussian Processes
Use Normal distributions for uncertainty
GLMs
Generalized Linear Models use specific error distributions
Monte Carlo
Sampling from distributions for simulations

🏒 Real-World Scenario: E-commerce Click-Through Rates

Variable
X = CTR
Distribution
Beta
Mean
2.5%
Use
A/B Testing

Analysis: Click-through rates (CTR) follow Beta distribution. With 1000 impressions and 25 clicks, we can model CTR and compare different ad versions using Bayesian inference.

πŸ“ Conceptual Formulas

PMF

Probability Mass Function

P(X = x) = p(x)

For discrete random variables
0 ≀ p(x) ≀ 1 for each x
Ξ£ p(x) = 1

PDF

Probability Density Function

P(a ≀ X ≀ b) = ∫ f(x) dx

For continuous random variables
f(x) β‰₯ 0 for all x
∫ f(x) dx = 1

CDF

Cumulative Distribution Function

F(x) = P(X ≀ x)

Works for both discrete & continuous
0 ≀ F(x) ≀ 1
F(-∞) = 0, F(∞) = 1

πŸ’‘ Focus: Understand concepts, not heavy derivations. Know when to use PMF vs PDF, and how CDF helps with probability calculations.

βœ… Chapter Summary

πŸ”₯

Core Purpose

Model random variable behavior and uncertainty.

πŸ“š

6 Key Concepts

Random variables, discrete/continuous distributions, PMF vs PDF, CDF, real-world meaning.

πŸ“

Conceptual Formulas

Focus on understanding PMF, PDF, and CDF relationships.

🧠

Data Science Use

Feature distribution analysis and noise modeling.

πŸ“‹ Quick Reference Guide

Discrete β†’ PMF Continuous β†’ PDF CDF β†’ P(X ≀ x) PDF Area = Probability Normal β†’ Bell Curve Binomial β†’ Counts
πŸ€–
DSPython AI Assistant βœ–
πŸ‘‹ Hi! I’m your AI assistant. Paste your code here, I will find bugs for you.