SQL Introduction: The Language of Databases
SQL (Structured Query Language) lets you define data structures, query data, filter records, aggregate values, and combine tables. In data science, SQL is your gateway to clean, accurate, and well-scoped datasets before you jump into Python, Pandas, or ML.
1) What is SQL and why should a Data Scientist care?
SQL is a domain-specific language designed to work with relational databases (PostgreSQL, MySQL, SQL Server, SQLite, etc.). It is standardized by ANSI/ISO, and most engines share common syntax with small dialect differences.
Key capabilities
- Data definition (DDL): create and alter tables, constraints, indexes.
- Data manipulation (DML): select, insert, update, delete rows.
- Data control: permissions & security (not covered in this page).
- Transactions: grouped operations that commit or rollback together.
For data science
- Pull only the columns and rows you need (reduce memory).
- Pre-aggregate at the source (faster analytics downstream).
- Join reference tables to enrich features cleanly.
- Keep lineage: reproducible, auditable data extracts.
Common SQL families
PostgreSQL (analytics-friendly, standards-first), MySQL/MariaDB (web & OLTP), SQL Server (enterprise), SQLite (embedded/file-based).
In this course page we use SQLite for demos. You can reuse the same queries on other engines with minimal tweaks.
2) Big picture: DDL vs DML (and friends)
DDL — Data Definition Language
Defines schema (tables, columns, constraints, indexes).
CREATE TABLE— make a tableALTER TABLE— modify structureDROP TABLE— delete table
DML — Data Manipulation Language
Reads/writes data inside tables.
SELECT— query rowsINSERT— add rowsUPDATE— change rowsDELETE— remove rows
Other categories
- TCL (transactions):
BEGIN,COMMIT,ROLLBACK - DCL (security):
GRANT,REVOKE
Minimal mental model
-- Define your structure first (DDL)
CREATE TABLE Employees (
EmployeeID INTEGER PRIMARY KEY,
FirstName TEXT NOT NULL,
LastName TEXT NOT NULL,
Department TEXT,
Salary REAL
);
-- Then put data (DML)
INSERT INTO Employees VALUES (101, 'Ravi', 'Kumar', 'IT', 60000);
-- Then query it (DML)
SELECT FirstName, Department, Salary
FROM Employees
WHERE Salary >= 60000
ORDER BY Salary DESC;
3) The demo database we’ll use on this page
To keep the practice smooth and persistent across requests, your Flask backend
is wired to a local SQLite database file (dspython.db). We pre-load two tables:
Employees and Projects. You’ll use them in later lessons
(basic queries, aggregates, joins, subqueries…).
Employees
EmployeeID— integer primary keyFirstName,LastName— textDepartment— text (e.g., IT, HR, Finance)Salary— real number
Projects
ProjectID— integer primary keyProjectName— textDepartment— text (owner)Budget— real number
SELECT * FROM Employees; or SELECT * FROM Projects;
4) Keys & Constraints: building correct tables
Constraints are rules the database enforces to keep your data correct. Good constraints reduce bugs, simplify analysis, and improve query performance by enabling better indexes.
Primary Key (PK)
Uniquely identifies each row. Usually a single integer column or a stable business key.
CREATE TABLE Departments (
DeptID INTEGER PRIMARY KEY, -- PK
DeptName TEXT UNIQUE NOT NULL -- also unique
);
In SQLite, INTEGER PRIMARY KEY auto-rows behave like an alias for the internal rowid.
Foreign Key (FK)
Links child rows to a parent table PK. Enforces referential integrity.
CREATE TABLE Employees (
EmployeeID INTEGER PRIMARY KEY,
FirstName TEXT NOT NULL,
LastName TEXT NOT NULL,
DeptID INTEGER,
Salary REAL CHECK (Salary >= 0),
FOREIGN KEY (DeptID) REFERENCES Departments(DeptID)
ON UPDATE CASCADE
ON DELETE SET NULL
);
ON UPDATE/DELETE actions (CASCADE, RESTRICT, SET NULL) are crucial for lifecycle rules.
UNIQUE
Ensures no duplicate values across one or more columns.
CREATE TABLE Users (
UserID INTEGER PRIMARY KEY,
Email TEXT NOT NULL UNIQUE,
Username TEXT NOT NULL UNIQUE
);
NOT NULL
Column must always have a value; avoids “unknowns” where they don’t make sense.
CREATE TABLE Products (
ProductID INTEGER PRIMARY KEY,
Name TEXT NOT NULL,
Price REAL NOT NULL CHECK (Price >= 0)
);
CHECK
Custom rule. Great for simple domain constraints.
CREATE TABLE Ratings (
ReviewID INTEGER PRIMARY KEY,
Stars INTEGER CHECK (Stars BETWEEN 1 AND 5)
);
CREATE INDEX to speed reads (careful: indexes slow inserts).
5) Data types, NULLs, and three-valued logic
SQL engines have rich type systems. SQLite is more dynamic (type affinity), but the principles carry across engines. Understanding NULL is essential: it means “unknown/absent,” and it propagates through expressions.
Common types
- INTEGER/INT — whole numbers
- REAL/DECIMAL — floating or fixed-precision
- TEXT/VARCHAR — strings
- DATE/TIME/TIMESTAMP — temporal
- BLOB — binary
NULL behavior
Comparisons with NULL are not true/false—they’re unknown.
SELECT 1 = NULL; -- NULL (unknown)
SELECT 1 < NULL; -- NULL
SELECT NULL IS NULL; -- TRUE
SELECT NULL = NULL; -- NULL (not TRUE)
Guarding against NULL
Use COALESCE to substitute defaults; use IS NULL/IS NOT NULL in predicates.
SELECT FirstName, COALESCE(Department, 'Unknown') AS Dept
FROM Employees
WHERE Department IS NULL OR Salary IS NOT NULL;
AVG(NULL, 5) = 5), but COUNT(*) counts all rows
whereas COUNT(column) counts only non-NULL values in that column.
6) Normalization: keep data tidy, avoid anomalies
Normal forms help you organize tables to reduce duplication and update anomalies. Don’t memorize every rule— understand the spirit: “one fact in one place.”
1NF — Atomic columns
No repeating groups or arrays; each cell holds a single value.
-- Anti-pattern: comma-separated skills in one column
-- Better: a separate Skills table linked by EmployeeID
2NF — No partial dependencies
If the PK is composite, each non-key attribute depends on the whole key.
-- Orders( OrderID, ProductID, ProductName ) -- ProductName depends only on ProductID
-- Move ProductName to Products, keep Orders(OrderID, ProductID, ...)
3NF — No transitive dependencies
Non-key attributes shouldn’t depend on other non-keys.
-- Employees( EmpID, DeptID, DeptName )
-- DeptName depends on DeptID, not EmpID → split to Departments table
7) How SQL runs your query (logical order & planning)
SQL is declarative—you state what you want, the engine decides how. Knowing the logical order helps you predict results and debug.
Logical evaluation order
FROM(+JOIN) — build row sourceWHERE— filter rows (no aggregates yet)GROUP BY— make groupsHAVING— filter groups (after aggregates)SELECT— compute expressions/aliasesORDER BY— sort final rowsLIMIT/OFFSET— trim results
SELECT Department, AVG(Salary) AS AvgPay
FROM Employees
WHERE Salary > 0
GROUP BY Department
HAVING AVG(Salary) >= 60000
ORDER BY AvgPay DESC
LIMIT 5;
EXPLAIN (ANALYZE); in SQLite, EXPLAIN QUERY PLAN ... to reason about performance.
8) Writing readable SQL: style & maintainability
Style checklist
- Uppercase keywords (
SELECT,FROM,WHERE). - One clause per line; indent nested queries.
- Use
ASfor clear aliases; avoid single-letter aliases unless short-lived. - Prefer
JOIN ... ONover legacy comma joins. - Comment tricky logic (
-- why, not just-- what).
Before vs After
-- ❌ Hard to read
select e.firstname, e.lastname, p.projectname
from employees e, projects p
where e.department = p.department and e.salary > 60000;
-- ✅ Clear & maintainable
SELECT
e.FirstName,
e.LastName,
p.ProjectName
FROM Employees AS e
JOIN Projects AS p
ON e.Department = p.Department
WHERE e.Salary > 60000;
9) Transactions & safety: don’t lose data
Use transactions to group changes. If anything fails, ROLLBACK keeps your database consistent.
BEGIN; -- or START TRANSACTION
UPDATE Employees
SET Salary = Salary * 1.05
WHERE Department = 'IT';
INSERT INTO Projects (ProjectID, ProjectName, Department, Budget)
VALUES (10, 'Cost Optimization', 'IT', 15000);
-- If everything looks good:
COMMIT;
-- If something went wrong:
-- ROLLBACK;
10) Hands-on Practice: CREATE TABLE
Let’s practice DDL. Create a brand-new table named Customers with:
CustomerID— INTEGER PRIMARY KEYCustomerName— TEXT, cannot be NULL- Optional:
Email(TEXT UNIQUE)
After successful
CREATE, try:
SELECT * FROM Customers;
to view the current records (initially empty).