SQL Introduction - Data Science with Python

1) What is SQL and why should a Data Scientist care?

SQL is a domain-specific language designed to work with relational databases (PostgreSQL, MySQL, SQL Server, SQLite, etc.). It is standardized by ANSI/ISO, and most engines share common syntax with small dialect differences.

Key capabilities

Data definition (DDL): create and alter tables, constraints, indexes.
Data manipulation (DML): select, insert, update, delete rows.
Data control: permissions & security (not covered in this page).
Transactions: grouped operations that commit or rollback together.

For data science

Pull only the columns and rows you need (reduce memory).
Pre-aggregate at the source (faster analytics downstream).
Join reference tables to enrich features cleanly.
Keep lineage: reproducible, auditable data extracts.

Common SQL families

PostgreSQL (analytics-friendly, standards-first), MySQL/MariaDB (web & OLTP), SQL Server (enterprise), SQLite (embedded/file-based).

In this course page we use SQLite for demos. You can reuse the same queries on other engines with minimal tweaks.

2) Big picture: DDL vs DML (and friends)

DDL — Data Definition Language

Defines schema (tables, columns, constraints, indexes).

CREATE TABLE — make a table
ALTER TABLE — modify structure
DROP TABLE — delete table

DML — Data Manipulation Language

Reads/writes data inside tables.

SELECT — query rows
INSERT — add rows
UPDATE — change rows
DELETE — remove rows

Other categories

TCL (transactions): BEGIN, COMMIT, ROLLBACK
DCL (security): GRANT, REVOKE

Minimal mental model

-- Define your structure first (DDL)
CREATE TABLE Employees (
  EmployeeID INTEGER PRIMARY KEY,
  FirstName  TEXT NOT NULL,
  LastName   TEXT NOT NULL,
  Department TEXT,
  Salary     REAL
);

-- Then put data (DML)
INSERT INTO Employees VALUES (101, 'Ravi', 'Kumar', 'IT', 60000);

-- Then query it (DML)
SELECT FirstName, Department, Salary
FROM Employees
WHERE Salary >= 60000
ORDER BY Salary DESC;

3) The demo database we’ll use on this page

To keep the practice smooth and persistent across requests, your Flask backend is wired to a local SQLite database file (dspython.db). We pre-load two tables: Employees and Projects. You’ll use them in later lessons (basic queries, aggregates, joins, subqueries…).

Employees

EmployeeID — integer primary key
FirstName, LastName — text
Department — text (e.g., IT, HR, Finance)
Salary — real number

Projects

ProjectID — integer primary key
ProjectName — text
Department — text (owner)
Budget — real number

Tip: You can always re-inspect the base tables in later pages using: SELECT * FROM Employees; or SELECT * FROM Projects;

4) Keys & Constraints: building correct tables

Constraints are rules the database enforces to keep your data correct. Good constraints reduce bugs, simplify analysis, and improve query performance by enabling better indexes.

Primary Key (PK)

Uniquely identifies each row. Usually a single integer column or a stable business key.

CREATE TABLE Departments (
  DeptID     INTEGER PRIMARY KEY,  -- PK
  DeptName   TEXT UNIQUE NOT NULL  -- also unique
);

In SQLite, INTEGER PRIMARY KEY auto-rows behave like an alias for the internal rowid.

Foreign Key (FK)

Links child rows to a parent table PK. Enforces referential integrity.

CREATE TABLE Employees (
  EmployeeID  INTEGER PRIMARY KEY,
  FirstName   TEXT NOT NULL,
  LastName    TEXT NOT NULL,
  DeptID      INTEGER,
  Salary      REAL CHECK (Salary >= 0),
  FOREIGN KEY (DeptID) REFERENCES Departments(DeptID)
    ON UPDATE CASCADE
    ON DELETE SET NULL
);

ON UPDATE/DELETE actions (CASCADE, RESTRICT, SET NULL) are crucial for lifecycle rules.

UNIQUE

Ensures no duplicate values across one or more columns.

CREATE TABLE Users (
  UserID    INTEGER PRIMARY KEY,
  Email     TEXT NOT NULL UNIQUE,
  Username  TEXT NOT NULL UNIQUE
);

NOT NULL

Column must always have a value; avoids “unknowns” where they don’t make sense.

CREATE TABLE Products (
  ProductID   INTEGER PRIMARY KEY,
  Name        TEXT NOT NULL,
  Price       REAL NOT NULL CHECK (Price >= 0)
);

CHECK

Custom rule. Great for simple domain constraints.

CREATE TABLE Ratings (
  ReviewID  INTEGER PRIMARY KEY,
  Stars     INTEGER CHECK (Stars BETWEEN 1 AND 5)
);

Indexing tip: Primary keys and unique constraints create indexes automatically. For frequent joins/filters on non-unique columns, add CREATE INDEX to speed reads (careful: indexes slow inserts).

5) Data types, NULLs, and three-valued logic

SQL engines have rich type systems. SQLite is more dynamic (type affinity), but the principles carry across engines. Understanding NULL is essential: it means “unknown/absent,” and it propagates through expressions.

Common types

INTEGER/INT — whole numbers
REAL/DECIMAL — floating or fixed-precision
TEXT/VARCHAR — strings
DATE/TIME/TIMESTAMP — temporal
BLOB — binary

NULL behavior

Comparisons with NULL are not true/false—they’re unknown.

SELECT 1 = NULL;    -- NULL (unknown)
SELECT 1 < NULL;    -- NULL
SELECT NULL IS NULL; -- TRUE
SELECT NULL = NULL; -- NULL (not TRUE)

Guarding against NULL

Use COALESCE to substitute defaults; use IS NULL/IS NOT NULL in predicates.

SELECT FirstName, COALESCE(Department, 'Unknown') AS Dept
FROM Employees
WHERE Department IS NULL OR Salary IS NOT NULL;

Gotcha: Aggregations ignore NULLs (e.g., AVG(NULL, 5) = 5), but COUNT(*) counts all rows whereas COUNT(column) counts only non-NULL values in that column.

6) Normalization: keep data tidy, avoid anomalies

Normal forms help you organize tables to reduce duplication and update anomalies. Don’t memorize every rule— understand the spirit: “one fact in one place.”

1NF — Atomic columns

No repeating groups or arrays; each cell holds a single value.

-- Anti-pattern: comma-separated skills in one column
-- Better: a separate Skills table linked by EmployeeID

2NF — No partial dependencies

If the PK is composite, each non-key attribute depends on the whole key.

-- Orders( OrderID, ProductID, ProductName )  -- ProductName depends only on ProductID
-- Move ProductName to Products, keep Orders(OrderID, ProductID, ...)

3NF — No transitive dependencies

Non-key attributes shouldn’t depend on other non-keys.

-- Employees( EmpID, DeptID, DeptName )
-- DeptName depends on DeptID, not EmpID → split to Departments table

Pragmatic note: It’s fine to denormalize selectively for performance or analytics (star schema), but do it consciously and document your reasoning.

7) How SQL runs your query (logical order & planning)

SQL is declarative—you state what you want, the engine decides how. Knowing the logical order helps you predict results and debug.

Logical evaluation order

FROM (+ JOIN) — build row source
WHERE — filter rows (no aggregates yet)
GROUP BY — make groups
HAVING — filter groups (after aggregates)
SELECT — compute expressions/aliases
ORDER BY — sort final rows
LIMIT/OFFSET — trim results

SELECT Department, AVG(Salary) AS AvgPay
FROM Employees
WHERE Salary > 0
GROUP BY Department
HAVING AVG(Salary) >= 60000
ORDER BY AvgPay DESC
LIMIT 5;

Execution plans: Engines choose indexes, join orders, and algorithms (nested loop, hash join). On Postgres, use EXPLAIN (ANALYZE); in SQLite, EXPLAIN QUERY PLAN ... to reason about performance.

8) Writing readable SQL: style & maintainability

Style checklist

Uppercase keywords (SELECT, FROM, WHERE).
One clause per line; indent nested queries.
Use AS for clear aliases; avoid single-letter aliases unless short-lived.
Prefer JOIN ... ON over legacy comma joins.
Comment tricky logic (-- why, not just -- what).

Before vs After

-- ❌ Hard to read
select e.firstname, e.lastname, p.projectname
from employees e, projects p
where e.department = p.department and e.salary > 60000;

-- ✅ Clear & maintainable
SELECT
  e.FirstName,
  e.LastName,
  p.ProjectName
FROM Employees AS e
JOIN Projects  AS p
  ON e.Department = p.Department
WHERE e.Salary > 60000;

9) Transactions & safety: don’t lose data

Use transactions to group changes. If anything fails, ROLLBACK keeps your database consistent.

BEGIN;  -- or START TRANSACTION

UPDATE Employees
SET Salary = Salary * 1.05
WHERE Department = 'IT';

INSERT INTO Projects (ProjectID, ProjectName, Department, Budget)
VALUES (10, 'Cost Optimization', 'IT', 15000);

-- If everything looks good:
COMMIT;

-- If something went wrong:
-- ROLLBACK;

Isolation levels: Engines offer READ COMMITTED, REPEATABLE READ, SERIALIZABLE, etc. For analytics, default levels usually suffice; for OLTP, choose carefully to avoid race conditions or deadlocks.

10) Hands-on Practice: CREATE TABLE

Let’s practice DDL. Create a brand-new table named Customers with:

CustomerID — INTEGER PRIMARY KEY
CustomerName — TEXT, cannot be NULL
Optional: Email (TEXT UNIQUE)

Once your SQL runs successfully, the system will allow you to mark the lesson as complete.
After successful CREATE, try:

SELECT * FROM Customers;

to view the current records (initially empty).

Write SQL command:

Output will appear here…

SQL Introduction: The Language of Databases