DSPython Logo DSPython

Univariate Analysis (One Variable)

Master the art of analyzing a single variable: Distributions, Counts, and Outliers.

Data Visualization Beginner 30 min

πŸ” Introduction: "Uni" means One

Univariate Analysis is the first step in EDA. It involves analyzing data **one variable at a time**. It doesn't look at causes or relationships (like "does X affect Y?"), it just describes the data we have.

We divide variables into two types: 1. **Categorical:** (Species, City, Yes/No) -> We look at "Counts". 2. **Numerical:** (Age, Price, Height) -> We look at "Distribution" & "Spread".

πŸ“Š Topic 1: Categorical Data

For text or groups, we simply want to know: **How many items are in each group?**

βœ… Key Tools:

  • **value_counts():** The Pandas text summary.
  • **sns.countplot():** The visual bar chart.

πŸ’» Example: Counting Species

print(df['species'].value_counts())

sns.countplot(x='species', data=df)
plt.title("Number of Penguins per Species")
plt.show()

πŸ“ˆ Topic 2: Numerical Distribution (`histplot`)

For numbers, we want to see the **Shape**. Is the data centered? Is it skewed (tilted) to one side?

πŸ’‘ The Histogram & KDE:

**Histogram:** Groups numbers into "bins" and counts them.
**KDE (Kernel Density Estimate):** Draws a smooth curve over the histogram to show the "flow" of data.

[Image of a normal distribution curve]

πŸ’» Example: Body Mass Distribution

sns.histplot(x='body_mass_g', data=df, kde=True)
# kde=True adds the smooth curve line.

πŸ“¦ Topic 3: Outliers & Spread (`boxplot`)

Histograms show shape, but **Boxplots** show **Statistcal Range**. They are the best tool to detect **Outliers** (extreme values).

πŸ’» Example: Checking for Extreme Weights

sns.boxplot(x='body_mass_g', data=df)
# Any dots outside the whiskers are outliers.

πŸ“š Module Summary

  • Categorical Data: Use value_counts() and sns.countplot().
  • Numerical Data: Use describe() for numbers, sns.histplot() for shape.
  • Outliers: Use sns.boxplot() to spot extreme values.
  • KDE: The smooth line that represents probability density.

πŸ€” Interview Q&A

Tap on the questions below to reveal the answers.

A **Bar Plot** is for Categorical data (gaps between bars). A **Histogram** is for Numerical data (bins touching each other) to show continuous frequency distribution.

**Skewness** measures asymmetry. **Right Skewed** means the tail extends to the right (Mean > Median). **Left Skewed** means the tail extends to the left (Mean < Median).

Boxplots are used to identify **Outliers** and see the **Interquartile Range (IQR)** (the middle 50% of data). They are robust summaries that ignore extreme noise.

πŸ€–
DSPython AI Assistant βœ–
πŸ‘‹ Hi! I’m your AI assistant. Paste your code here, I will find bugs for you.