Detailed Introduction to Word Embedding

Naresh Nishad - Sep 22 - - Dev Community

Day 3: Word Embedding

As part of my #75DaysOfLLM journey, today we’re diving into Word Embeddings.

What is Word Embedding?

Word embedding is a technique in natural language processing (NLP) that represents words as dense vectors of real numbers. Instead of treating words as discrete symbols, word embedding allows us to capture the meaning and relationships between words in a continuous vector space.

Key points:

  • Words are represented as vectors (lists of numbers)
  • These vectors typically have 100-300 dimensions
  • Similar words have similar vector representations
  • The vectors capture semantic and syntactic information about words

Word Embedding visualization

Why is Word Embedding Important?

  1. Capturing word relationships: Word embeddings can represent complex relationships between words, such as analogies (e.g., king - man + woman ≈ queen).

  2. Dimensionality reduction: Instead of using one-hot encoding (which would result in extremely large, sparse vectors), word embeddings provide a dense, low-dimensional representation.

  3. Improved performance: Word embeddings have been shown to improve performance in various NLP tasks, including:

    All the below use cases are already explained in Day 1 of this series.

    • Text classification
    • Named entity recognition
    • Machine translation
    • Sentiment analysis
  4. Transfer learning: Pre-trained word embeddings can be used as input features for other NLP models, allowing knowledge transfer between tasks.

  5. Handling out-of-vocabulary words: Some embedding techniques can generate representations for words not seen during training.

How Does Word Embedding Work?

Word embedding algorithms typically work by analyzing large corpora of text and learning vector representations based on the contexts in which words appear. The underlying principle is the distributional hypothesis, which states that words that occur in similar contexts tend to have similar meanings.

The process generally involves:

  1. Defining a context window (e.g., 5 words before and after the target word)
  2. Scanning through the text corpus
  3. Applying a learning algorithm to adjust word vectors based on observed contexts
  4. Iterating until convergence or a specified number of epochs

The resulting vector space has interesting properties:

  • Words with similar meanings cluster together
  • Vector arithmetic can reveal semantic relationships (e.g., vec("king") - vec("man") + vec("woman") ≈ vec("queen"))

Popular Word Embedding Techniques

1. Word2Vec

Developed by researchers at Google, Word2Vec uses 2 layer shallow neural networks to learn word embeddings. It comes in two flavors:

a) Continuous Bag of Words (CBOW):

  • Predicts a target word given its context words
  • Faster to train and better for frequent words

b) Skip-gram:

  • Predicts context words given a target word
  • Works well with small datasets and rare words

Python Example using Gensim:



from gensim.models import Word2Vec

sentences = [['I', 'love', 'natural', 'language', 'processing'],
             ['Word', 'embedding', 'is', 'fascinating'],
             ['Machine', 'learning', 'is', 'the', 'future']]

# Train the model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, sg=0)

# Get the vector for a word
vector = model.wv['language']

# Find similar words
similar_words = model.wv.most_similar('language', topn=3)
print(similar_words)


Enter fullscreen mode Exit fullscreen mode

Parameters explained:

  • vector_size: Dimensionality of word vectors (default is 100)
  • window: Maximum distance between current and predicted word (default is 5)
  • min_count: Ignores words with frequency below this threshold
  • sg: Training algorithm: 1 for skip-gram; 0 for CBOW (default)

2. GloVe (Global Vectors for Word Representation)

GloVe, developed at Stanford, learns word vectors by analyzing global word-word co-occurrence statistics from a corpus.

Key features:

  • Combines the advantages of local context window methods and global matrix factorization
  • Often performs well on word analogy tasks
  • Efficient to train on large corpora

3. fastText

Created by Facebook's AI Research lab, fastText extends the Word2Vec model by treating each word as composed of character n-grams.

Advantages:

  • Can generate embeddings for out-of-vocabulary words
  • Performs well for morphologically rich languages
  • Captures subword information

Python Example using Gensim:



from gensim.models import FastText

# Train the model
model = FastText(sentences, vector_size=100, window=5, min_count=1, sg=0)

# Get vector for a word (works even for words not in training data)
vector = model.wv['language']

# Find similar words
similar_words = model.wv.most_similar('unpredictable', topn=3)
print(similar_words)

Enter fullscreen mode Exit fullscreen mode




Differences Between These Techniques

  1. Training approach:

    • Word2Vec: Uses local context windows
    • GloVe: Uses global co-occurrence statistics
    • fastText: Uses character n-grams and local context windows
  2. Handling of rare/unseen words:

    • Word2Vec: Struggles with rare words, can't handle unseen words
    • GloVe: Better with rare words due to global statistics, can't handle unseen words
    • fastText: Handles both rare and unseen words well due to subword information
  3. Training speed:

    • Word2Vec: Fast
    • GloVe: Generally slower than Word2Vec
    • fastText: Similar to Word2Vec, but can be slower due to subword processing
  4. Performance on different tasks:

    • Word2Vec: Good all-around performance
    • GloVe: Often excels at analogy tasks
    • fastText: Performs well on morphologically rich languages and tasks requiring subword information

Conclusion

Word embedding has revolutionized many NLP tasks by providing rich, dense representations of words that capture semantic and syntactic information. By transforming words into numerical vectors, we enable machines to process and understand language in ways that more closely resemble human comprehension.

Each embedding technique (Word2Vec, GloVe, and fastText) has its strengths and is suited for different types of tasks or languages. As you work on NLP projects, experimenting with different embedding techniques can often lead to significant improvements in model performance.

The field of word embedding continues to evolve, with more recent developments including contextualized word embeddings (like ELMo and BERT) that generate different word vectors based on the surrounding context. These advancements promise even more sophisticated language understanding capabilities for AI systems in the future.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .