DSPython Logo DSPython

Natural Language Processing Basics

Learn to convert text into numbers and build classification models with Bag-of-Words and TF-IDF.

NLP Intermediate 75 min

Topic 1: What is NLP?

Natural Language Processing (NLP) is a field of machine learning and artificial intelligence that deals with the interaction between computers and human (natural) language. The ultimate goal is to enable computers to "understand," interpret, and generate human language in a valuable way.

This is hard because human language is unstructured. Unlike a spreadsheet with neat columns (`Age`, `Price`), text is a messy sequence of words, with context, sarcasm, and grammar. Our first job is to give it structure.



Examples of NLP include: Spam filters, machine translation (Google Translate), virtual assistants (Siri, Alexa), and sentiment analysis (is this movie review positive or negative?).


Topic 2: The Core Problem: Text to Numbers

Machine learning models (like Logistic Regression or Naive Bayes) cannot understand the word "car". They only understand numbers.

The first step in any NLP pipeline is **vectorization**: the process of converting a collection of text documents into numerical feature vectors. This involves creating a "vocabulary" of all the unique words in our dataset and then representing each document based on the words it contains.


Topic 3: Bag-of-Words (CountVectorizer)

The simplest and most common vectorization method is the **Bag-of-Words (BoW)** model. It's called this because it "forgets" all grammar and word order, treating each document as just a "bag" containing a jumble of words.



It works in two steps:

  1. Build Vocabulary: Scan all documents and find every unique word. This becomes your vocabulary (your feature list).
  2. Count: For each document, count how many times each word from the vocabulary appears in it.

We use sklearn.feature_extraction.text.CountVectorizer for this.

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    'This is the first document.',
    'This document is the second document.'
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

# X is now a matrix (a "document-term matrix"):
#          document  first  is  second  the  this
# Doc 1:      1        1     1      0     1     1
# Doc 2:      2        0     1      1     1     1


Topic 4: TF-IDF (TfidfVectorizer)

A problem with Bag-of-Words is that it treats all words equally. The word "the" will get a very high count, but it's useless for telling documents apart. The word "rocket" might only appear once, but it's *very* informative.

TF-IDF (Term Frequency - Inverse Document Frequency) solves this. It's a numerical statistic that reflects how *important* a word is to a document in a collection.

TF-IDF is the product of two numbers:

  • Term Frequency (TF): How often a word appears in *one document*. This is just like Bag-of-Words. (e.g., "rocket" appears 5 times in Doc 1).
  • Inverse Document Frequency (IDF): A score for how *rare* a word is across *all documents*.
    Words that are everywhere (like "the", "a") get an IDF score close to 0.
    Words that are rare (like "rocket", "nebula") get a high IDF score.

When you multiply them ($TF \times IDF$), you get the final TF-IDF score. Words that are frequent in *one* document (high TF) but rare *overall* (high IDF) get the highest scores.

We use sklearn.feature_extraction.text.TfidfVectorizer. This is the standard, go-to vectorizer for most text classification tasks.