Day 2: Text Preprocessing for NLP

As part of my #75DaysOfLLM journey, we’re diving into Text Preprocessing. Text preprocessing transforms raw text into clean, structured data for machines to analyze. In this post, we’ll explore the steps involved in preprocessing text, from cleaning to tokenization, stop word removal, and more.

Text Cleaning

Text often contains unwanted elements like HTML tags, punctuation, numbers, and special characters that don’t add value. Cleaning the text involves removing these elements to reduce noise and focus on meaningful content.

Sample Code:

import re

# Sample text
text = "Hello! This is <b>sample</b> text with numbers (1234) and punctuation!!"

# Removing HTML tags and special characters
cleaned_text = re.sub(r'<.*?>', '', text)  # Remove HTML tags
cleaned_text = re.sub(r'[^a-zA-Z\s]', '', cleaned_text)  # Remove punctuation/numbers
print(cleaned_text)

Tokenization

Tokenization is the process of breaking down text into smaller units, usually words or sentences. This step allows us to work with individual words (tokens) rather than a continuous stream of text. Each token serves as a unit of meaning that the NLP model can understand.

In tokenization, there are two common approaches:

Word Tokenization: Splits the text into individual words.
Sentence Tokenization: Splits the text into sentences, which is useful for certain applications like text summarization.

Sample Code:

from nltk.tokenize import word_tokenize

# Tokenization
tokens = word_tokenize(cleaned_text)
print(tokens)

Stop Word Removal

Stop words are common words (like "the", "is", "and") that appear frequently but don’t contribute much meaning. Removing them reduces the dataset size and focuses on more important words.

Sample Code:

from nltk.corpus import stopwords

# Removing stop words
stop_words = set(stopwords.words('english'))
filtered_tokens = [token for token in tokens if token not in stop_words]
print(filtered_tokens)

Stemming and Lemmatization

Both stemming and lemmatization reduce words to their base form, helping to standardize different forms of the same word (e.g., "running" and "ran"). Stemming is faster but less accurate, while lemmatization returns valid words based on vocabulary.

Stemming: Stemming involves chopping off word endings to get to the root form, without worrying about whether the resulting word is valid. It’s fast but less accurate.
Lemmatization: Lemmatization reduces words to their base form, but it uses vocabulary and morphological analysis to return valid words. This makes it more accurate than stemming.

Common Algorithms for Stemming and Lemmatization:

Porter Stemmer: One of the most popular stemming algorithms.
Lancaster Stemmer: A more aggressive stemmer that may truncate words more drastically.
WordNet Lemmatizer: Part of the NLTK library, it uses a dictionary to find the correct lemma of a word.

Sample Code:

from nltk.stem import PorterStemmer, WordNetLemmatizer

# Stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]
print("Stemmed Tokens:", stemmed_tokens)

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
print("Lemmatized Tokens:", lemmatized_tokens)

Expanding Contractions

Contractions are shortened versions of word groups, such as “can’t” for “cannot” or “I’m” for “I am.” While contractions are natural in everyday language, it’s often useful to expand them during text preprocessing to maintain consistency.

Sample Code:

import contractions

# Expanding contractions
expanded_text = contractions.fix("I can't do this right now.")
print(expanded_text)

Spell Check

Spelling errors in text can affect NLP models. Automatic spell check tools can detect and correct common misspellings.

Sample Code:

from textblob import TextBlob

# Spell check
text_with_typos = "Ths is an exmple of txt with speling errors."
corrected_text = str(TextBlob(text_with_typos).correct())
print(corrected_text)

Conclusion

Preprocessing is a vital first step in any NLP pipeline. Cleaning, tokenizing, removing stop words, and handling tasks like stemming and spell checking ensure that your text data is ready for analysis. Stay tuned for the next part of my #75DaysOfLLM challenge as we dive deeper into NLP and language models!