Vector Similarity Search with Azure SQL database and OpenAI

Davide Mauri - Jun 4 '23 - - Dev Community

Vector databases are gaining quite a lot of interest lately. Using text embeddings and vector operations makes extremely easy to find similar "things". Things can be articles, photos, products…everything. As one can easily imagine, this ability is great to easily implement suggestions in applications. From providing suggestions on similar articles or other products that may be of interest, to quickly finding and grouping similar items, the applications are many.

A great article to understand how embeddings work, is the following: Introducing text and code embeddings.

Reading the mentioned articles, you can learn that "embeddings are numerical representations of concepts converted to number sequences, which make it easy for computers to understand the relationships between those concepts."

Text Embeddings

More specifically, embeddings are vectors…hence the great interest for vector databases.

But are vector databases really needed? At the end of the day a vector is just a list of numbers and finding if two vectors represent similar object is as easy as calculating the distance between the vectors. One of the most common and useful distance metric is the cosine distance and, even better, the related cosine similarity

Cosine Distance

The real complex part is calculating the embeddings, but thanks to Azure OpenAI, everyone has an easily accessible REST service that can used to get the embeddings using pre-trained ML models. In this article we will use OpenAI to generate vectors for doing similarity search and then use Azure SQL database to store and search for similar vectors.

In this article we’ll build a sample solution to find Wikipedia articles that are related to any topic we may be interested in. As usual all the code is available in GitHub:

https://github.com/Azure-Samples/azure-sql-db-openai

The pre-calculated embeddings, both for the title and the body, of a selection of Wikipedia articles, is made available by OpenAI here:

https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip

Vectors in Azure SQL database

Vectors can be efficiently stored in Azure SQL database by columnstore indexes. There is no specific data type available to store a vector in Azure SQL database, but we can use some human ingenuity to realize that a vector is just a list of numbers. As a result, we can store a vector in a table very easily by creating a column to contain vector data. One row per vector element. We can then use a columnstore index to efficiently store and search for vectors.

Using this Wikipedia article as starting point, you can see that there are two vectors, one to store title embeddings and one to store article embeddings:

Article with embeddings

The vectors can be more efficiently stored into a table like this:

CREATE TABLE [dbo].[wikipedia_articles_embeddings_titles_vector]
(
    [article_id] [int] NOT NULL,
    [vector_value_id] [int] NOT NULL,
    [vector_value] [float] NOT NULL
) 
Enter fullscreen mode Exit fullscreen mode

On that table we can create a column store index to efficiently store and search for vectors. Then it is just a matter of calculating the distance between vectors to find the closest. Thanks to the internal optimization of the columnstore (that uses SIMD AVX-512 instructions to speed up vector operations) the distance calculation is extremely fast.

The most common distance is the cosine similarity, which can be calculated quite easily in SQL.

Calculating cosine similarity

Cosine similarity can be calculated in SQL using the following formula, given two vectors a and b:

SELECT 
    SUM(a.value * b.value) / (  
        SQRT(SUM(a.value * a.value)) * SQRT(SUM(b.value * b.value))   
    ) AS cosine_similarity
FROM
    vectors_values
Enter fullscreen mode Exit fullscreen mode

Really easy. What is now left to do is to query the Azure OpenAI REST service so that, given any text, we can get the vector representation of that text. Then we can use that vector to calculate the cosine distance against all the Wikipedia articles stored in the database and take only the closest ones which will return the article most likely connect to the topic we are interested in.

Querying OpenAI

Create an Azure OpenAI resource via the Azure portal. For this specific sample you have to deploy an Embedding model using the text-embedding-ada-002 model, the same used for the Wikipedia articles source we are using in this sample. Once that is done, you need to get the API KEY and the URL of the deployed model (read the Embeddings REST API documentation) and then you can use sp_invoke_external_rest_endpoint to call the REST API from Azure SQL database.

declare @retval int, @response nvarchar(max);
declare @payload nvarchar(max) = json_object('input': 'Isaac Asimov');

exec @retval = sp_invoke_external_rest_endpoint
    @url = 'https://<your-app-name>.openai.azure.com/openai/deployments/<deployment-id>/embeddings?api-version=2023-03-15-preview',
    @method = 'POST',
    @headers = '{"api-key":"<your api key>"}',
    @payload = @payload,
    @response = @response output;
The response is a vector of 1536 elements in JSON format. Vector values can be easily extracted using the following T-SQL code:

select [key] as value_id, [value] from openjson(@response, '$.result.data[0].embedding')
Enter fullscreen mode Exit fullscreen mode

Source code

If you are interested in trying this amazing capability by yourself, you can find the source code here:

https://github.com/Azure-Samples/azure-sql-db-openai

Conclusion

The provided sample is not optimized. For example, the square of the vectors: SUM(a.value * a.value) could be pre-calculated and stored in a table for even better efficiency and performance. The sample is purposely simple to make it easier to understand the concept. Even if the sample is also not optimized for performance, it is still quite fast. On an eight vCore Azure SQL database, the query takes only half of a second to return the fifty most similar articles. The cosine distance is calculated on 25,000 articles, for a total of 38 million vector values. Pretty cool, fast and useful!

Returing the 50 most similar articles

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .