DSPython Logo DSPython

Data Type Conversion

Fix incorrect data types to enable calculations and analysis.

Data Science Beginner 45 min

πŸ”§ Introduction: Sorting Your Tools

Imagine you have a box full of toolsβ€”hammers, wrenches, and screwdrivers. If you tried to use a wrench to hammer a nail, it wouldn't work. Each task requires the correct tool type.

In Data Science, **Data Types** are your tools. If Pandas loads the **Price** column as `object` (text), you can't calculate the average. If it loads **Date** as `object`, you can't find the day of the week. **Type Conversion** fixes this issue.

**Golden Rule:** Always ensure your numerical columns are `int` or `float` and your time columns are `datetime64`.

πŸ“‹ Topic 1: Checking Types: .dtypes and .astype()

The easiest way to check the types of all columns is using the **`.dtypes`** attribute.

🎯 Common Dtypes:

  • **int64:** Whole numbers (Age, Count).
  • **float64:** Decimal numbers (Price, Temperature).
  • **object:** Text or mixed data (Name, City).
  • **datetime64:** Proper time-based data (Timestamp, Date of Birth).

πŸ’» Example: Checking and Converting

data = {'A': [1], 'B': ['10']} Β # B is text
df = pd.DataFrame(data)

print("Before: \n", df.dtypes)

# Use .astype() to change the type
df['B'] = df['B'].astype(int)

print("\nAfter: \n", df.dtypes)

❌ Topic 2: Cleaning Currency and Symbols

If a column has characters like **`$`** or **`,`**, `astype(float)` will fail because it cannot convert those symbols to a number. You must use `str.replace` first.

πŸ’» Example: Currency Conversion

data = {'Price': ['$1,200', '$500', '$800']}
df = pd.DataFrame(data)

# 1. Remove '$' sign
df['Price'] = df['Price'].str.replace('$', '')

# 2. Remove commas (important for numbers > 999)
df['Price'] = df['Price'].str.replace(',', '')

# 3. Convert to float
df['Price'] = df['Price'].astype(float)

print(df.dtypes)

πŸ’‘ Advanced Error Handling in Conversion:

If your column has numbers but also some messy text (`100`, `200`, `N/A`), simply using `astype()` will crash. The solution is to use the `errors='coerce'` parameter within `pd.to_numeric()`.

series = pd.Series(['10', '20', 'N/A', '30'])

# Invalid values become NaN (None)
cleaned = pd.to_numeric(series, errors='coerce')
print(cleaned)
# Now you can use .fillna() or .dropna() on the NaNs!

πŸ“… Topic 3: Working with Dates: pd.to_datetime()

Dates and times are the trickiest data type. Never treat a date as a simple string. The `pd.to_datetime()` function converts messy date formats into a standard `datetime64` object.

πŸ’» Example: Basic Date Conversion

data = {'date_str': ['2024-01-01', '05/15/2023', 'Nov 30, 2022']}
df = pd.DataFrame(data)

# Convert the column to datetime
df['Date'] = pd.to_datetime(df['date_str'])

print(df['Date'].dtypes) # Output: datetime64[ns]

🌟 Topic 4: Feature Extraction with .dt

Once a column is in `datetime64` format, you can use the special **`.dt` accessor** to easily extract components like the year, month, or day. These extracted components are known as **Time Features** and are vital for time series models.

πŸ’» Example: Extracting Time Features

# Assuming 'df' has the 'Date' column from the previous example

# Create a new column for Year (int64)
df['Year'] = df['Date'].dt.year

# Find the Day of the Week (Monday=0, Sunday=6)
df['Weekday'] = df['Date'].dt.dayofweek

print(df[['Date', 'Year', 'Weekday']])

πŸ“š Module Summary

  • **Check:** Use df.dtypes to find initial types.
  • **Convert:** Use .astype(int) or pd.to_numeric().
  • **Clean First:** Remove currency/commas using .str.replace() before conversion.
  • **Dates:** Use pd.to_datetime() to standardize date strings.
  • **Features:** Use the .dt accessor to extract year, month, or day.

πŸ€” Interview Q&A

Tap on the questions below to reveal the answers.

It will raise a **ValueError**. Pandas cannot convert non-numeric characters (like '$', ',', or 'abc') into integers or floats. You must clean those characters using .str.replace() first.

You use errors='coerce' when you have a column that is mostly numbers but contains a few random text entries (like 'N/A' or '?' symbols). It forces Pandas to convert the valid numbers and turns the invalid entries into **NaN**, allowing you to handle them later with .fillna().

It's important because you cannot perform calculations (like finding the difference between two dates) or extract time features (like Day of Week or Quarter) from a simple text string. Converting to datetime64 enables time series analysis and feature engineering.

The .dt accessor is a special tool used only on Series of the datetime64 dtype. It allows quick access to time components like .dt.year, .dt.month, .dt.dayofweek, or .dt.quarter, which are then used as new features in machine learning models.