Feature Engineering
Learn to clean, transform, and create features to get the most out of your machine learning models.
Topic 1: What is Feature Engineering?
Feature Engineering is the process of using domain knowledge to create, transform, or select the most relevant "features" (columns or inputs) for a machine learning model. It is often the most important step in the entire ML pipeline.
A model only sees numbers; it has no understanding of "context." Our job is to provide that context by engineering better features.
Analogy: A Model is a Chef

A machine learning model is like a world-class chef. You can give them messy, un-chopped, raw ingredients (start_time, end_time) and they might make an *okay* meal. But if you first clean, chop, and prepare those ingredients (duration = end_time - start_time, day_of_week), the chef can easily create a masterpiece.
"Garbage In, Garbage Out." If your features are bad, your model will be bad. Better features are more important than a "better" algorithm.
Key Tasks in Feature Engineering:
- Imputation: Handling missing data.
- Encoding: Converting categorical data (text) into numbers.
- Scaling: Changing the range of numerical data.
- Creation: Creating new features from existing ones.
Topic 2: Handling Missing Data (Imputation)
Most sklearn models cannot run if your data contains `NaN` (Not a Number) or "null" values. We must handle them first.
Strategy 1: Dropping
- Drop Rows: `df.dropna()`. This is the simplest option, but you lose valuable data. It's only safe if a tiny fraction (< 5%) of your rows have missing data.
- Drop Columns: `df.drop('column_name', axis=1)`. Use this if a column is mostly empty (e.g., > 60% missing) and probably not useful.
Strategy 2: Imputation
Imputation is the process of "filling in" missing values with a substitute. We use sklearn.impute.SimpleImputer.
- Numerical Data: Use
strategy='mean'orstrategy='median'. Median is safer if your data has extreme outliers. - Categorical Data: Use
strategy='most_frequent'(which fills in the "mode," or most common value).
from sklearn.impute import SimpleImputer
# For numerical data
num_imputer = SimpleImputer(strategy='mean')
X_train['age'] = num_imputer.fit_transform(X_train[['age']])
# For categorical data
cat_imputer = SimpleImputer(strategy='most_frequent')
X_train['embarked'] = cat_imputer.fit_transform(X_train[['embarked']])
Topic 3: Encoding Categorical Data
Models need numbers, not text. We must convert text-based categories (like "Red", "Green", "Blue") into a numerical format.
Strategy 1: One-Hot Encoding
Used for nominal categories, where there is *no inherent order* (e.g., "Country", "Gender", "Color").
It creates a new binary (0 or 1) column for each category. This prevents the model from learning a false order (e.g., that "Spain" > "France").
# The easiest way is with pandas:
df = pd.get_dummies(df, columns=['embarked'], drop_first=True)
# 'embarked' (with 3 categories S, C, Q) becomes:
# 'embarked_Q' (0 or 1)
# 'embarked_S' (0 or 1)
# (We use drop_first=True to avoid the "dummy variable trap")
Strategy 2: Ordinal Encoding
Used for ordinal categories, where there is a *clear order* (e.g., "Small" < "Medium" < "Large").
It converts each category to a single integer (0, 1, 2, 3...).
# Example:
class_mapping = {'First': 1, 'Second': 2, 'Third': 3}
df['class'] = df['class'].map(class_mapping)
Warning: Never use Ordinal Encoding on nominal data. You will trick your model into thinking "Spain" (e.g., 2) is twice as "important" as "France" (e.g., 1).
Topic 4: Feature Scaling & Transformation
Many models (like SVMs, KNNs, Neural Networks, and PCA) are sensitive to the *scale* of your features. If you have `Age` (0-100) and `Salary` (0-1,000,000), the `Salary` feature will dominate the model's calculations.
Note: Tree-based models (Decision Tree, Random Forest, Gradient Boosting) are *not* sensitive to feature scales, so you can skip this step for them.
Strategy 1: Standardization (StandardScaler)
This is the most common scaling method. It transforms your data to have a **mean of 0** and a **standard deviation of 1**.
It's best to use when your data is (mostly) normally distributed (a "bell curve"). It is not affected by outliers.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train[['age', 'fare']] = scaler.fit_transform(X_train[['age', 'fare']])
Strategy 2: Normalization (MinMaxScaler)
This method scales all data to be within a specific range, usually **0 to 1**.
It's useful for models that expect data in this range (like neural networks) or when your data is clearly *not* normally distributed. It is highly sensitive to outliers.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train[['age', 'fare']] = scaler.fit_transform(X_train[['age', 'fare']])
CRITICAL: `fit_transform` vs. `transform`
This is a vital concept. When you split your data into `X_train` and `X_test`:
- You `fit_transform()` the scaler on `X_train`. This learns the "rules" (the mean, max, etc.) from the training data *and* transforms it.
- You ONLY `transform()` the scaler on `X_test`. This applies the *exact same rules* it learned from the training data.
This prevents **Data Leakage**. Your test data must never influence your training process, and that includes the parameters of your scaler.