Univariate Analysis (One Variable)
Master the art of analyzing a single variable: Distributions, Counts, and Outliers.
π Introduction: "Uni" means One
Univariate Analysis is the first step in EDA. It involves analyzing data **one variable at a time**. It doesn't look at causes or relationships (like "does X affect Y?"), it just describes the data we have.
We divide variables into two types: 1. **Categorical:** (Species, City, Yes/No) -> We look at "Counts". 2. **Numerical:** (Age, Price, Height) -> We look at "Distribution" & "Spread".
π Topic 1: Categorical Data
For text or groups, we simply want to know: **How many items are in each group?**
β Key Tools:
- **value_counts():** The Pandas text summary.
- **sns.countplot():** The visual bar chart.
π» Example: Counting Species
print(df['species'].value_counts())
sns.countplot(x='species', data=df)
plt.title("Number of Penguins per Species")
plt.show()
π Topic 2: Numerical Distribution (`histplot`)
For numbers, we want to see the **Shape**. Is the data centered? Is it skewed (tilted) to one side?
π‘ The Histogram & KDE:
**Histogram:** Groups numbers into "bins" and counts them.
**KDE (Kernel Density Estimate):** Draws a smooth curve over the histogram to show the "flow" of data.
π» Example: Body Mass Distribution
sns.histplot(x='body_mass_g', data=df, kde=True)
π¦ Topic 3: Outliers & Spread (`boxplot`)
Histograms show shape, but **Boxplots** show **Statistcal Range**. They are the best tool to detect **Outliers** (extreme values).
π» Example: Checking for Extreme Weights
sns.boxplot(x='body_mass_g', data=df)
π Module Summary
- Categorical Data: Use
value_counts()andsns.countplot(). - Numerical Data: Use
describe()for numbers,sns.histplot()for shape. - Outliers: Use
sns.boxplot()to spot extreme values. - KDE: The smooth line that represents probability density.
π€ Interview Q&A
Tap on the questions below to reveal the answers.
A **Bar Plot** is for Categorical data (gaps between bars). A **Histogram** is for Numerical data (bins touching each other) to show continuous frequency distribution.
**Skewness** measures asymmetry. **Right Skewed** means the tail extends to the right (Mean > Median). **Left Skewed** means the tail extends to the left (Mean < Median).
Boxplots are used to identify **Outliers** and see the **Interquartile Range (IQR)** (the middle 50% of data). They are robust summaries that ignore extreme noise.