One of the coolest things about the recent wave of machine learning upgrades is zero-shot learning (ZSL). ZSL is the pinnacle of the few-shot learning paradigm. Few-shot learning involves getting models to learn from small amounts of data. Zero-shot learning takes this one step further—it involves getting models to recognize data they’ve never seen before.
Getting models to recognize something they have never seen sounds quite tricky, but humans do it well. For example, there are many types of animals you’ve never seen before. However, you would probably recognize a fish for a fish even if you’ve never seen that particular type of fish. How do you know that it’s a fish?
You might say it’s because it lives in water, swims, has gills, and generally feels like it should be a fish. But unlike you, machine learning models don’t have this sense of “feel,” at least not how we think. So, how can a machine learning model tell? By applying semantic similarity.
What Is Semantic Similarity?
Semantic similarity measures how similar two things are in their meaning. There are many ways to measure similarity metrics of vector embeddings. In a recent article, we covered five types of similarity metrics. That post also covers three types of similarity metrics for “dense” vectors, the vectors typically produced by vector embedding models, and two types of similarity metrics for binary vectors.
In the context of zero-shot learning, we can consider semantic similarity as measured through dense vectors. Dense vectors are named as such because there are few 0s in them. The entries in a dense vector are typically real numbers. An example of a dense embedding vector could be (0.1, 0.2, -0.1, 0.112, 0.34, -0.98).
Most of the time, these numbers are also between 0 and 1. Why? Because these numbers are the output of the second to last layer in a deep neural network. We use this output because it contains all of the semantic information a neural net has about its input data before making a prediction, and that’s what we want - the semantic representation.
How Does Zero Shot Learning Work?
Now that we understand a little bit about semantic similarity, we can dig into zero-shot learning. The main idea behind most zero-shot learning algorithms is finding ways to associate indirect information about data. In the example of the fish listed above, this would be the external factors like being the water, the shape, and perhaps having scales.
This information can all be encoded into numbers via vector embeddings. Models that can do ZSL can then take these quantified representations and compare and contrast new data to the data on which they are trained. You can think of it as assigning a label based on which cluster or clusters the data point is closest to.
Zero-shot learning can be applied to both vision and language. The first known papers on zero-shot learning were published at the same conference in 2008, one on language and one on vision. The language paper was titled “Dataless Classification,” and the vision paper was titled “Zero-data Learning.” The term zero-shot learning first came about in 2009.
Why Is Zero Shot Learning Important?
So why is zero-shot learning so important? The basic answer is that it elevates the ability to apply machine learning models to a whole new level.
One of the main challenges with machine learning is that it typically requires a considerable amount of data to train. While data quantity is already a huge challenge in and of itself, data quality is another challenge for model training. ZSL helps solve both of these problems.
With the power of semantic similarity via vector embeddings, we can use zero-shot learning to classify data without needing huge quantities of high-quality data. Models built with ZSL techniques, such as CLIP, can classify images or label text without having seen it before.
Classifying new data without needing a bunch of a priori knowledge lets us reduce data costs and increase access to machine learning via pre-trained models. This removes barriers for people and businesses to enter the machine learning and AI space.
What Are Some Examples of Zero Shot Classification Models?
Zero-shot classification has come a long way since 2008. One of the most popular ZSL models published recently is CLIP - Contrastive Language Image Pretraining - by OpenAI. Other popular models in this space include:
DUET by Chen et al. (Zhejiang University)
SPOT (VAEGAN) by Shreyank N Gowda (University of Oxford)
ZSL-KG by Nihal V. Nayak, Stephen H. Bach (Brown University)
ResNet-50 by Radford et al. (OpenAI)
Summary of Zero Shot Learning
In this article, we got some insight into zero-shot learning. ZSL is a transformative technique that has lowered entry barriers to AI/ML for businesses and individuals. The researchers in this space and the models they have produced have truly helped democratize AI. Zero-shot learning supplies the unique ability to classify images or label text that a model has never seen before.
Zero-shot learning works by using semantic similarity via vector embeddings. Models that do ZSL essentially predict classes based on how semantically similar things are. Much like how our brains work, such as in the fish example, they use auxiliary information to indicate whether or not the input data can be classified.
From humble beginnings in vision and language in 2008, zero-shot learning has come a long way. It’s now available for image classification, such as with ResNet 50, or even multimodal classification, such as with CLIP. There are many implementations of zero-shot learning techniques, and we expect better, more efficient techniques to come out in the future, along with the rise of large language models.