From Chaos to Clarity: The Journey of Text Cleaning in NLP (Part 2)
We’ve already taken our first steps in cleaning messy text—lowercasing, tokenization, punctuation removal, and filtering stopwords. Now, let’s dig deeper and refine our text even more with techniques that help AI truly understand meaning and structure.
Just as a librarian doesn’t just organize books but also ensures they have correct titles and summaries, we must refine our text further for **optimal machine understanding.
6 Stemming: Reducing Words to Their Root Form
📌 Problem: Words like "running," "runs," and "ran" are all forms of the word "run", but an NLP model might treat them as separate entities, increasing complexity.
💡 Solution: Stemming chops words down to their root form using rule-based reductions.
python
from nltk.stem import PorterStemmer
ps = PorterStemmer()
words = ["running", "flies", "easily", "loving"]
stemmed_words = [ps.stem(word) for word in words]
print(stemmed_words)
🔹 Before: ["running", "flies", "easily", "loving"]
🔹 After: ["run", "fli", "easili", "love"]
👉 Why is this useful?
- Reduces vocabulary size, improving computational efficiency.
- Helps NLP models group similar words together.
🔸 Limitations: Stemming is a crude process that doesn’t consider actual word meanings (e.g., "flies" became "fli"). That’s where lemmatization comes in!
7 Lemmatization: Converting Words to Their Dictionary Base Form
📌 Problem: Stemming blindly cuts words, sometimes producing meaningless roots. Instead, we need a smarter way to map words to their **actual base form.
💡 Solution: Lemmatization converts words to their dictionary base form using linguistic rules.
python
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
words = ["running", "flies", "easily", "better"]
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print(lemmatized_words)
🔹 Before: ["running", "flies", "easily", "better"]
🔹 After: ["running", "fly", "easily", "better"]
👉 Why is this useful?
- More accurate than stemming (e.g., "flies" becomes "fly", not "fli").
- Keeps words intelligible and meaningful.
8 Removing Numbers: Filtering Out Non-Useful Digits
📌 Problem: Many texts contain numbers that aren't useful for NLP tasks. Consider:
- "The price is 50 dollars" → Here, 50 is important.
- "Call me at 9876543210" → The number isn't useful for NLP processing.
💡 Solution: Remove numbers only when they don't add meaning.
python
import re
text = "The AI model trained on 50000 samples in 2023."
clean_text = re.sub(r'\d+', '', text)
print(clean_text)
🔹 Before: "The AI model trained on 50000 samples in 2023."
🔹 After: "The AI model trained on samples in."
👉 Why is this useful?
- Helps models focus on actual language instead of numerical noise.
- Reduces irrelevant variations in text.
9 Handling Contractions: Expanding Words for Better Understanding
📌 Problem: Text in conversations and social media often contains contractions like:
- "I'm" → "I am"
- "won't" → "will not"
- "they've" → "they have"
💡 Solution: Expand contractions into full words for clarity.
python
import contractions
text = "I'm learning NLP because it's awesome!"
expanded_text = contractions.fix(text)
print(expanded_text)
🔹 Before: "I'm learning NLP because it's awesome!"
🔹 After: "I am learning NLP because it is awesome!"
👉 Why is this useful?
- Improves text clarity by converting informal language into standard English.
- Helps NLP models understand intent and meaning better.
🔟 Removing Special Characters: Eliminating Unnecessary Symbols
📌 Problem: Text often contains special symbols like @, #, $, %, & which are irrelevant to NLP.
💡 Solution: Strip special characters while keeping meaningful text intact.
python
text = "Hello @world! This is #NLP with $100% efficiency."
clean_text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
print(clean_text)
🔹 Before: "Hello @world! This is #NLP with $100% efficiency."
🔹 After: "Hello world This is NLP with 100 efficiency"
👉 Why is this useful?
- Removes unnecessary noise from text.
- Focuses on words that add value to NLP models.
Bringing It All Together: The Power of Preprocessing
Each of these text-cleaning steps plays a critical role in preparing data for machine learning models.
✅ Lowercasing ensures uniformity.
✅ Tokenization breaks text into meaningful chunks.
✅ Punctuation & stopword removal reduce noise.
✅ Stemming & lemmatization standardize words for better comprehension.
✅ Handling contractions, numbers, and special characters refines the text further.
🔹 Before Preprocessing:
"I'm LOVING NLP! Running & learning AI with 50,000 samples is fun!! 😊"
🔹 After Preprocessing:
"I am love NLP run learn AI sample fun"
🚀 What’s Next?
With clean text, we can now move into advanced NLP techniques, such as **POS tagging, Named Entity Recognition (NER), Vectorization, and Deep Learning-based embeddings.
📌 Want to go deeper? Stay tuned for the next part of our NLP journey where we transform cleaned text into structured machine-readable data! 🚀