Build "For you" recommendations using AI on Fastly!

Andrew Betts - Aug 7 - - Dev Community

Forget the hype; where is AI delivering real value? Let's use edge computing to harness the power of AI and make smarter user experiences that are also fast, safe and reliable.

Recommendations are everywhere, and everyone knows that making web experiences more personalized makes them more engaging and successful. My Amazon homepage knows that I like home furnishings, kitchenware and right now, summer clothing:

Amazon homepage

Today, most platforms make you choose between either being fast or being personalized. At Fastly, we think you — and your users — deserve to have both. If every time your web server generates a page, it is only suitable for one end user, you can't benefit from caching it, which is what edge networks like Fastly do well.

So how can you benefit from edge caching, and yet make content personalized? We've written a lot before about how to break up complex client requests into multiple smaller, cacheable backend requests, and you'll find tutorials, code examples and demos in the personalization topic on our developer hub.

But what if you want to go further and generate the personalisation data at the edge? The "edge" - the Fastly servers handling your website's traffic, are the closest point to the end user that's still within your control. A great place to produce content that's specific to one user.

The "For you" use case

Product recommendations are inherently transient, specific to an individual user and likely to change frequently. But they also don't need to persist - we don't typically need to know what we've recommended to each person, only whether a particular algorithm achieves better conversion than another. Some recommendation algorithms need access to a large amount of state data, like what users are most similar to you and their purchase or rating history, but often that data is easy to pregenerate in bulk.

Basically, generating recommendations usually doesn't create a transaction, doesn't need any locks in your data store, and makes use of input data that's either immediately available from the current user's session, or created in an offline build process.

Sounds like we can generate recommendations at the edge!

A real world example

Let's take a look at the website of the New York Metropolitan Museum of Art:

Screenshot of metmuseum.org

Each of the 500,000 or so objects in the Met's collection has a page with a picture and information about it. It also has this list of related objects:

Screenshot showing related items

This seems to use a fairly straightforward system of faceting to generate these relationships, showing me other artworks by the same artist, or other objects in the same wing of the museum, or which are also made of paper or originate in the same time period.

The nice thing about this system (from a developer perspective!) is that since it's only based on the one input object, it can be pre-generated into the page.

What if we want to augment this with a selection of recommendations that are based on the end user's personal browsing history as they navigate around the Met's website, not just based on this one object?

Adding personalized recommendations

There's lots of ways we can do this, but I wanted to try using a language model, since AI is happening right now, and it's really different from the way the Met's existing related artworks mechanism seems to work. Here's the plan:

  1. Download the Met's open access collection dataset.
  2. Run it through a language model to create vector embeddings – lists of numbers suitable for machine learning tasks.
  3. Build a performant similarity search engine for the resulting half a million vectors (representing the Met’s artworks) and load it into KV store so we can use it from Fastly Compute.

Once we've done all that, we should be able to, as you browse the Met's website:

  1. Track the artworks you visit in a cookie.
  2. Look up the vectors corresponding to those artworks.
  3. Calculate an average vector representing your browsing interests.
  4. Plug that into our similarity search engine to find the most similar artworks.
  5. Load details about those artworks from the Met's Object API and augment the page with personalized recommendations.

Et voilà, personalized recommendations:

Screenshot of personalized recommendations

OK, so let's break that down.

Creating the dataset

The Met's raw dataset is a CSV with lots of columns and looks like this:

Object Number,Is Highlight,Is Timeline Work,Is Public Domain,Object ID,Gallery Number,Department,AccessionYear,Object Name,Title,Culture,Period,Dynasty,Reign,Portfolio,Constituent ID,Artist Role,Artist Prefix,Artist Display Name,Artist Display Bio,Artist Suffix,Artist Alpha Sort,Artist Nationality,Artist Begin Date,Artist End Date,Artist Gender,Artist ULAN URL,Artist Wikidata URL,Object Date,Object Begin Date,Object End Date,Medium,Dimensions,Credit Line,Geography Type,City,State,County,Country,Region,Subregion,Locale,Locus,Excavation,River,Classification,Rights and Reproduction,Link Resource,Object Wikidata URL,Metadata Date,Repository,Tags,Tags AAT URL,Tags Wikidata URL
1979.486.1,False,False,False,1,,The American Wing,1979,Coin,One-dollar Liberty Head Coin,,,,,,16429,Maker," ",James Barton Longacre,"American, Delaware County, Pennsylvania 1794–1869 Philadelphia, Pennsylvania"," ","Longacre, James Barton",American,1794      ,1869      ,,http://vocab.getty.edu/page/ulan/500011409,https://www.wikidata.org/wiki/Q3806459,1853,1853,1853,Gold,Dimensions unavailable,"Gift of Heinz L. Stoppelmann, 1979",,,,,,,,,,,,,,http://www.metmuseum.org/art/collection/search/1,,,"Metropolitan Museum of Art, New York, NY",,,
1980.264.5,False,False,False,2,,The American Wing,1980,Coin,Ten-dollar Liberty Head Coin,,,,,,107,Maker," ",Christian Gobrecht,1785–1844," ","Gobrecht, Christian",American,1785      ,1844      ,,http://vocab.getty.edu/page/ulan/500077295,https://www.wikidata.org/wiki/Q5109648,1901,1901,1901,Gold,Dimensions unavailable,"Gift of Heinz L. Stoppelmann, 1980",,,,,,,,,,,,,,http://www.metmuseum.org/art/collection/search/2,,,"Metropolitan Museum of Art, New York, NY",,,
Enter fullscreen mode Exit fullscreen mode

Simple enough to transform that into two columns, an ID and a string:

id,description
1,"One-dollar Liberty Head Coin; Type: Coin; Artist: James Barton Longacre; Medium: Gold; Date: 1853; Credit: Gift of Heinz L. Stoppelmann, 1979"
2,"Ten-dollar Liberty Head Coin; Type: Coin; Artist: Christian Gobrecht; Medium: Gold; Date: 1901; Credit: Gift of Heinz L. Stoppelmann, 1980"
3,"Two-and-a-Half Dollar Coin; Type: Coin; Medium: Gold; Date: 1927; Credit: Gift of C. Ruxton Love Jr., 1967"
Enter fullscreen mode Exit fullscreen mode

Now we can use the transformers package from Hugging Face AI toolset, and generate embeddings of each of these descriptions. We used the sentence-transformers/all-MiniLM-L12-v2 model, and used principal component analysis (PCA) to reduce the resulting vectors to 5 dimensions. That gives you something like:

[
  {
    "id": 1,
    "vector": [ -0.005544120445847511, -0.030924081802368164, 0.008597176522016525, 0.20186401903629303, 0.0578165128827095 ]
  },
  {
    "id": 2,
    "vector": [ -0.005544120445847511, -0.030924081802368164, 0.008597176522016525, 0.20186401903629303, 0.0578165128827095 ]
  },
  
]
Enter fullscreen mode Exit fullscreen mode

We have half a million of these, so it's not possible to store this entire dataset within the edge app's memory. And we want to do a custom type of similarity search over this data, which is something a traditional key-value store doesn't offer. Since we’re building a real-time experience, we also really want to avoid having to search half a million vectors at a time.

So, let's partition the data. We can use KMeans clustering to group vectors that are similar to each other. We sliced the data into 500 clusters of varying sizes, and calculated a center point called a “centroid vector” for each of those clusters. If you plotted this vector space in two dimensions and zoomed in, it might look a bit like this:

Clustering illustration

The red crosses are the mathematical center points of each cluster of vectors, called centroids. They can work like wayfinders for our half-million-vector space. For instance, if we want to find the 10 most similar vectors to a given vector A, we can first look for the nearest centroid (out of 500), then conduct our search only within its corresponding cluster–a much more manageable area!

Now we have 500 small datasets and an index that maps centroid points to the relevant dataset. Next, to enable real-time performance, we want to precompile search graphs so that we don't need to initialize and construct them at runtime, and can use as little CPU time as possible. A really fast nearest-neighbor algorithm is Hierarchical Navigable Small Worlds (HNSW), and it has a pure Rust implementation, which we're using to write our edge app. So we wrote a small standalone Rust app to construct the HNSW graph structs for each dataset, and then used bincode to export the memory of the instantiated struct into a binary blob.

Now, those binary blobs can be loaded into KV store, keyed against the cluster index, and the cluster index can be included in our edge app.

This architecture lets us load parts of the search index into memory on demand. And since we’ll never have to search more than a few thousand vectors at a time, our searches will always be cheap and fast.

Building the edge app

The application that we run at the edge needs to handle several types of requests:

  • HTML pages: We fetch these from metmuseum.org and transform the response to add extra front-end <script> and <style> tags, so we can inject a bit of our own front end processing and content
  • The Fastly script and style resources referenced by those extra tags, which we can serve directly out of the edge app's binary.
  • The recommender endpoint, which generates and returns the recommendations.
  • All other (non-HTML) requests: Images, and the Met's own scripts and stylesheets, which we proxy directly from their domain without alteration.

We initially built this app in JavaScript, but ended up porting the recommender part to Rust because we liked the HNSW implementation in instant-distance.

The client side JavaScript does a few interesting things:

  1. Using IntersectionObserver, we trigger an event when the user scrolls down the page to the related objects section. This is a super efficient API that's much better than using older methods like onscroll.
  2. Make a fetch to our special recommendations API endpoint (which we can then handle at the edge and return object information)
  3. Compose some HTML using a template built into a client-side function
  4. Append that HTML to the page and move the intersection observer to the new element so as you scroll through the recommendations, we keep loading more.

This way, we can deliver the main HTML payload without invoking our recommendation algorithm, but the recommendations are delivered fast enough that we can load them as you scroll and they'll almost certainly be there by the time you get to them.

I like doing things this way because getting that first above-the-fold view to the user as fast as possible is absolutely paramount. Anything that you can't see unless you scroll can be loaded later, and especially if it is a complex piece of personalized content - there's no point generating it if the user isn't planning to scroll.

Closing thoughts

So now you have the best of both worlds: the ability to serve highly personalized content, almost never requiring any blocking fetches to origin, and an optimized HTML payload that renders incredibly fast, allowing your application to enjoy effectively limitless scalability and near perfect resilience.

It's not a perfect solution. It'd be great if Fastly offered more higher level features to expose edge data via query mechanisms other than a simple key lookup (let us know if that would help you!) and this specific mechanism has obvious flaws - if I have separate interests in two or more very different things (say 19th century oil paintings and ancient Roman amphora) I would get recommendations which would be the theoretical semantic "middle point" between those, not a very useful result.

Still, hopefully this demonstrates the principle that figuring out how to do work at the edge often results in outsized benefits in terms of scalability, performance and resilience.

Let us know what you build on our community thread!

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .