DSPython Logo DSPython

Feature Engineering

Learn to clean, transform, and create features to get the most out of your machine learning models.

Machine Learning Fundamental 75 min

Topic 1: What is Feature Engineering?

Feature Engineering is the process of using domain knowledge to create, transform, or select the most relevant "features" (columns or inputs) for a machine learning model. It is often the most important step in the entire ML pipeline.

A model only sees numbers; it has no understanding of "context." Our job is to provide that context by engineering better features.

Analogy: A Model is a Chef


A machine learning model is like a world-class chef. You can give them messy, un-chopped, raw ingredients (start_time, end_time) and they might make an *okay* meal. But if you first clean, chop, and prepare those ingredients (duration = end_time - start_time, day_of_week), the chef can easily create a masterpiece.

"Garbage In, Garbage Out." If your features are bad, your model will be bad. Better features are more important than a "better" algorithm.

Key Tasks in Feature Engineering:

  • Imputation: Handling missing data.
  • Encoding: Converting categorical data (text) into numbers.
  • Scaling: Changing the range of numerical data.
  • Creation: Creating new features from existing ones.

Topic 2: Handling Missing Data (Imputation)

Most sklearn models cannot run if your data contains `NaN` (Not a Number) or "null" values. We must handle them first.

Strategy 1: Dropping

  • Drop Rows: `df.dropna()`. This is the simplest option, but you lose valuable data. It's only safe if a tiny fraction (< 5%) of your rows have missing data.
  • Drop Columns: `df.drop('column_name', axis=1)`. Use this if a column is mostly empty (e.g., > 60% missing) and probably not useful.

Strategy 2: Imputation

Imputation is the process of "filling in" missing values with a substitute. We use sklearn.impute.SimpleImputer.

  • Numerical Data: Use strategy='mean' or strategy='median'. Median is safer if your data has extreme outliers.
  • Categorical Data: Use strategy='most_frequent' (which fills in the "mode," or most common value).
from sklearn.impute import SimpleImputer

# For numerical data
num_imputer = SimpleImputer(strategy='mean')
X_train['age'] = num_imputer.fit_transform(X_train[['age']])

# For categorical data
cat_imputer = SimpleImputer(strategy='most_frequent')
X_train['embarked'] = cat_imputer.fit_transform(X_train[['embarked']])

Topic 3: Encoding Categorical Data

Models need numbers, not text. We must convert text-based categories (like "Red", "Green", "Blue") into a numerical format.

Strategy 1: One-Hot Encoding

Used for nominal categories, where there is *no inherent order* (e.g., "Country", "Gender", "Color").

It creates a new binary (0 or 1) column for each category. This prevents the model from learning a false order (e.g., that "Spain" > "France").

# The easiest way is with pandas:
df = pd.get_dummies(df, columns=['embarked'], drop_first=True)

# 'embarked' (with 3 categories S, C, Q) becomes:
# 'embarked_Q' (0 or 1)
# 'embarked_S' (0 or 1)
# (We use drop_first=True to avoid the "dummy variable trap")

Strategy 2: Ordinal Encoding

Used for ordinal categories, where there is a *clear order* (e.g., "Small" < "Medium" < "Large").

It converts each category to a single integer (0, 1, 2, 3...).

# Example:
class_mapping = {'First': 1, 'Second': 2, 'Third': 3}
df['class'] = df['class'].map(class_mapping)

Warning: Never use Ordinal Encoding on nominal data. You will trick your model into thinking "Spain" (e.g., 2) is twice as "important" as "France" (e.g., 1).


Topic 4: Feature Scaling & Transformation

Many models (like SVMs, KNNs, Neural Networks, and PCA) are sensitive to the *scale* of your features. If you have `Age` (0-100) and `Salary` (0-1,000,000), the `Salary` feature will dominate the model's calculations.

Note: Tree-based models (Decision Tree, Random Forest, Gradient Boosting) are *not* sensitive to feature scales, so you can skip this step for them.

Strategy 1: Standardization (StandardScaler)

This is the most common scaling method. It transforms your data to have a **mean of 0** and a **standard deviation of 1**.

It's best to use when your data is (mostly) normally distributed (a "bell curve"). It is not affected by outliers.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train[['age', 'fare']] = scaler.fit_transform(X_train[['age', 'fare']])

Strategy 2: Normalization (MinMaxScaler)

This method scales all data to be within a specific range, usually **0 to 1**.

It's useful for models that expect data in this range (like neural networks) or when your data is clearly *not* normally distributed. It is highly sensitive to outliers.

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train[['age', 'fare']] = scaler.fit_transform(X_train[['age', 'fare']])

CRITICAL: `fit_transform` vs. `transform`

This is a vital concept. When you split your data into `X_train` and `X_test`:

  1. You `fit_transform()` the scaler on `X_train`. This learns the "rules" (the mean, max, etc.) from the training data *and* transforms it.
  2. You ONLY `transform()` the scaler on `X_test`. This applies the *exact same rules* it learned from the training data.

This prevents **Data Leakage**. Your test data must never influence your training process, and that includes the parameters of your scaler.