Although I spend the majority of my time on building broad, cloud native strategies and systems, I must admit that some of my favorite work in data is quite niche – recommendation systems. Over the past decade I've had the opportunity to build quite a few recommendation systems. Several were the expected ecomm and media use cases, although I also had the opportunity to build in social, and even internal research systems.
These systems are rewarding in two ways: First with sufficient data and good algorithms they almost always yield good results - on one side increasing usage/revenue, and most importantly helping users find what they want. On the other side there is a bit of both art and science. Yes the algorithms are there, and they require selection and the technical work of training and tuning, but there is also a great bit of creativity in mapping these algorithms to user experience, combining them in interesting ways, and even planning for a bit of “fun” and surprise.
In the past, building these systems was pretty heavy on the ML engineering side. Primarily you would be leveraging OSS algorithms, and forced to build your own frameworks for training, serving, and feedback pipelines. Not that this necessarily a horrible slog, at least for me, as building these sorts of things are fun and rewarding in their own way. However I personally always wanted to get to the fun and creative parts.
And Then Amazon Personalize
Amazon Personalize was introduced in Re:Invent of 2018, and went GA in the summer of 2019. It started out targeting pretty basic recommendation use cases, but as it stands now in 2023 I can safely say you can build a comprehensive recommender system completely within the product. This ranges from prepacked algorithms, through to serving and event collection infrastructure. This gives the opportunity to skip ahead to the really fun stuff!
Deployment methodology
Other than just trivial explorations, where I might use the console, I am always Infrastructure-as-code first. Not only does this provide a repeatable way of building and tearing down infrastructure, it’s also a great way to learn about the system from the ground up. However with Amazon Personalize, you will find that IAC coverage is only for very foundational components, namely the DataSet Group (topmost project container), Datasets, and Solutions (a untrained model configuration).
Rest assured those smart folks at AWS didn’t forget these components, or backlog them to get an MVP out the door. Most of the downstream components, particularly solution versions and campaigns are meant to be dynamic, and programmatically managed. Potentially one miss being Event Trackers, which are a foundational one-time setup, and hopefully make it into CloudFormation someday soon.
In an ideal fully productionalized system the flow would look something like the diagram below.
For you step function fans out there, this is an absolutely perfect use case. A step function and several lambdas to start and poll the various Personalize API interactions would do the job nicely. And getting back to the IAC conversation - you could absolutely IAC both the Lambdas and step functions!
But what if you just want to keep it simple, and get things going quickly, or your organization is not ready to build complex step function infrastructure?
Management Notebook Approach
As an engineer, I have a love-hate relationship with notebooks. They are indeed great for prototyping and exploring data. However notebooks are permissive of bad programming habits, and in most cases they end up being run-on-sentence type scripts. But, they can be very helpful when used as “management scripts", and feel much less yucky.
In one of my latest Personalize projects I used cloud formation for dataset groups, datasets and schemas. I then created a management notebook for first time creation of all remaining infrastructure components, and then a re-training notebook for, you guessed it, re-training. I’ve shared the important bits in this repo. Although this repo is meant to be illustrative of the approach, you could certainly customize it and use it for your own custom solution.
Although the notebooks are runnable locally, I host both notebooks in Glue, and have the retraining notebook cron’d to run every hour. And if you wanted to achieve IAC nirvana with this pragmatic solution, you could absolutely IAC the Glue notebooks.
So in summary..
Personalization is awesome, unless you have a really custom/unique recommendation use, there is little reason to build a custom recommender. Personalize is going to require a bit of work to create sustainable infrastructure deployment, definitely consider a pragmatic mix of IAC and management notebooks..