Day 3: Word Embedding
As part of my #75DaysOfLLM journey, today we’re diving into Word Embeddings.
What is Word Embedding?
Word embedding is a technique in natural language processing (NLP) that represents words as dense vectors of real numbers. Instead of treating words as discrete symbols, word embedding allows us to capture the meaning and relationships between words in a continuous vector space.
Key points:
- Words are represented as vectors (lists of numbers)
- These vectors typically have 100-300 dimensions
- Similar words have similar vector representations
- The vectors capture semantic and syntactic information about words
Why is Word Embedding Important?
Capturing word relationships: Word embeddings can represent complex relationships between words, such as analogies (e.g., king - man + woman ≈ queen).
Dimensionality reduction: Instead of using one-hot encoding (which would result in extremely large, sparse vectors), word embeddings provide a dense, low-dimensional representation.
-
Improved performance: Word embeddings have been shown to improve performance in various NLP tasks, including:
All the below use cases are already explained in Day 1 of this series.
- Text classification
- Named entity recognition
- Machine translation
- Sentiment analysis
Transfer learning: Pre-trained word embeddings can be used as input features for other NLP models, allowing knowledge transfer between tasks.
Handling out-of-vocabulary words: Some embedding techniques can generate representations for words not seen during training.
How Does Word Embedding Work?
Word embedding algorithms typically work by analyzing large corpora of text and learning vector representations based on the contexts in which words appear. The underlying principle is the distributional hypothesis, which states that words that occur in similar contexts tend to have similar meanings.
The process generally involves:
- Defining a context window (e.g., 5 words before and after the target word)
- Scanning through the text corpus
- Applying a learning algorithm to adjust word vectors based on observed contexts
- Iterating until convergence or a specified number of epochs
The resulting vector space has interesting properties:
- Words with similar meanings cluster together
- Vector arithmetic can reveal semantic relationships (e.g., vec("king") - vec("man") + vec("woman") ≈ vec("queen"))
Popular Word Embedding Techniques
1. Word2Vec
Developed by researchers at Google, Word2Vec uses 2 layer shallow neural networks to learn word embeddings. It comes in two flavors:
a) Continuous Bag of Words (CBOW):
- Predicts a target word given its context words
- Faster to train and better for frequent words
b) Skip-gram:
- Predicts context words given a target word
- Works well with small datasets and rare words
Python Example using Gensim:
from gensim.models import Word2Vec
sentences = [['I', 'love', 'natural', 'language', 'processing'],
['Word', 'embedding', 'is', 'fascinating'],
['Machine', 'learning', 'is', 'the', 'future']]
# Train the model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, sg=0)
# Get the vector for a word
vector = model.wv['language']
# Find similar words
similar_words = model.wv.most_similar('language', topn=3)
print(similar_words)
Parameters explained:
-
vector_size
: Dimensionality of word vectors (default is 100) -
window
: Maximum distance between current and predicted word (default is 5) -
min_count
: Ignores words with frequency below this threshold -
sg
: Training algorithm: 1 for skip-gram; 0 for CBOW (default)
2. GloVe (Global Vectors for Word Representation)
GloVe, developed at Stanford, learns word vectors by analyzing global word-word co-occurrence statistics from a corpus.
Key features:
- Combines the advantages of local context window methods and global matrix factorization
- Often performs well on word analogy tasks
- Efficient to train on large corpora
3. fastText
Created by Facebook's AI Research lab, fastText extends the Word2Vec model by treating each word as composed of character n-grams.
Advantages:
- Can generate embeddings for out-of-vocabulary words
- Performs well for morphologically rich languages
- Captures subword information
Python Example using Gensim:
from gensim.models import FastText
# Train the model
model = FastText(sentences, vector_size=100, window=5, min_count=1, sg=0)
# Get vector for a word (works even for words not in training data)
vector = model.wv['language']
# Find similar words
similar_words = model.wv.most_similar('unpredictable', topn=3)
print(similar_words)
Differences Between These Techniques
-
Training approach:
- Word2Vec: Uses local context windows
- GloVe: Uses global co-occurrence statistics
- fastText: Uses character n-grams and local context windows
-
Handling of rare/unseen words:
- Word2Vec: Struggles with rare words, can't handle unseen words
- GloVe: Better with rare words due to global statistics, can't handle unseen words
- fastText: Handles both rare and unseen words well due to subword information
-
Training speed:
- Word2Vec: Fast
- GloVe: Generally slower than Word2Vec
- fastText: Similar to Word2Vec, but can be slower due to subword processing
-
Performance on different tasks:
- Word2Vec: Good all-around performance
- GloVe: Often excels at analogy tasks
- fastText: Performs well on morphologically rich languages and tasks requiring subword information
Conclusion
Word embedding has revolutionized many NLP tasks by providing rich, dense representations of words that capture semantic and syntactic information. By transforming words into numerical vectors, we enable machines to process and understand language in ways that more closely resemble human comprehension.
Each embedding technique (Word2Vec, GloVe, and fastText) has its strengths and is suited for different types of tasks or languages. As you work on NLP projects, experimenting with different embedding techniques can often lead to significant improvements in model performance.
The field of word embedding continues to evolve, with more recent developments including contextualized word embeddings (like ELMo and BERT) that generate different word vectors based on the surrounding context. These advancements promise even more sophisticated language understanding capabilities for AI systems in the future.