Amazon SageMaker: summing up 6 months of customer meetings

Julien Simon - Jul 31 '18 - - Dev Community

Amazon SageMaker was launched at re:Invent 2017 about 6 months ago. Since then, I’ve discussed with a lot of AWS customers how this new Machine Learning service could help them solve long-lasting pain points, freeing up time and resources to focus on the actual high-value Machine Learning tasks.

In this post, I’ll try to summarize these discussions, hoping to help more people get started in the right direction.

Oh great. Air France is on strike again. Better hurry up!

Pain point #1: “we don’t know where to start”

A lot of companies simply don’t know how to get started. I suppose you could write books on this topic, but here is some simple advice :)

Machine Learning on AWS is available in multiple layers. Depending on your skill level and your resources, one of these will make more sense and I would recommend that you explore this one first.

  • For general-purpose problems, application services are the simplest option. Whether it is image analysis, text to speech, natural language processing and so on, you don’t need to know the first thing about Machine Learning to use them. No dataset, no training: just pass your data to an API and get the job done.
  • If you need more control on the dataset and/or on the algorithm, Amazon SageMaker is the next logical step. For example, you may want to build a cancer detection model trained on your dataset of medical images, or a machine translation model specialized in legal jargon— two tasks that are too specialized for Amazon Rekognition and Amazon Translate to handle properly.
  • If you need full control on everything including infrastructure, then you should look into using EC2 instances with the Deep Learning AMI, a pre-built server image that includes the popular tools like TensorFlow, PyTorch and so on, as well as NVIDIA libraries and drivers for GPU training.

Now, what about embarking on Machine Learning projects? Here are two techniques that I see customers using successfully.

  • Start small, iterate and gradually add complexity (Gall’s Law!). Here too, the Great-12-Month-Project-That-Will-Deliver-Great-Results never really works, does it?
  • Build lots of small churches, not a cathedral. Think about the 100 small inefficiencies that drag your company down: manual processes, hard-coded “business rules” in your apps, etc. Prioritize the list and start building simple Machine Learning models to fix each one of them. This seems to make a positive difference faster and with less risk than the Great-Company-Wide-AI-Project-That-Will-Solve-All-Our-Problems. In my experience, there’s only one thing certain with the Big Bang approach: it *does* end with a bang.

Pain point #2: “we can’t (or don’t want to) write Machine Learn code”

This sounds like a weird one, but I hear it quite often: some companies have domain-specific problems that can’t be solved by the application services discussed above, but they doesn’t have enough Machine Learning skills — or they don’t want to spend the time — required to write everything from scratch.

Of course, there are plenty of good Machine Learning and Deep Learning libraries out there (scikit-learn, Apache Spark MLlib, etc.), providing developers with collections of built-in algorithms. I believe that Amazon SageMaker goes one step further by bringing developers:

  • highly scalable implementations that train models faster and cheaper than other implementations.
  • state-of-the-art algorithms like image classification (CNN architecture) , seq2seq (LSTM architecture), DeepAR (multi-variate time series), BlazingText (GPU implementation of Word2Vec) and more.

With these built-in algos, you don’t need to write a single line of Machine Learning code. The only code you’ll write is really “helper code”: defining where the dataset is stored, setting hyper-parameters, etc. You’ll find lots of useful examples in this notebook collection. As I often say: if you read and write 50 lines of Python, you can do this!

Pain point #3: “building and preparing the data set is hard”

Oh absolutely. Everyone working with data know that collecting, cleaning and transforming data is a lot of work. Sometimes to the data is a challenge in itself! I recently met the Data Science team of a very large enterprise customer, trying to collect data from decades of internal research. As they put it: “the data is in the PDFs”… I feel their pain.

Generally, a lot of ETL work is required prior to the actual Machine Learning process. AWS has a suite of Big Data and Analytics services that could certainly come in handy :)

In addition, Amazon SageMaker includes a Spark SDK that lets developers seamlessly integrate Spark applications with SageMaker-managed training and prediction jobs. In a nutshell, this lets you separate the ETL concern from the Machine Learning concern, use the best instance type and the optimal cluster size for each one of them, etc. If you’d like to know more, please refer to this blog post or this webinar.

Pain point #4: “Everything takes forever”

Another popular one. Machine Learning teams often live in a silo… or a bunker… or an ivory tower. They create a ticket to the IT team to get servers. They create more tickets to the Production team to deploy their models. Boundaries need to be crossed on a daily basis. People don’t talk to each other enough. Arrogant jerks everywhere make things worse. Everything will still be fine… right?

A lot of software teams have long adopted DevOps to solve this problem. As it turns out, I believe this also applies to Machine Learning (MLOps, anyone?). Of course, the point is not to turn Data Scientists and Machine Learning engineers into hardcore Ops warriors (the reverse wouldn’t work either).

What teams need is the ability to experiment, build and deploy models using a single and simple tool. I believe Amazon SageMaker lets them do that with its bespoke Python SDK. It effectively puts teams in charge of the whole process, from experimentation to production.

Again, the notebook collection will show you how to do this for all configurations: built-in algorithms, built-in environments for Deep Learning libraries and even for your own custom environment.

Pain point #5: “Infrastructure can’t keep up”

Same old, same old: there is never enough physical storage or compute to run all the projects customers want at the scale they need.

This problems rears its ugly head in many flavours:

  • Dev environments and prod environments are different, causing unexpected issues and regressions.
  • Not enough hardware is available for experimentation and it slows down innovation. I met a great company where teams use an intranet to reserve physical servers for weeks… and guess what, they “forget” to give them back and then the real fun starts :-/
  • GPUs that cost a ton of money a year ago are now outdated but they won’t be replaced for another couple of years.
  • The power consumption of GPU servers causes electrical and HVAC hotspots in datacenters.

The list goes on. Deep Learning only makes things worse, with its large image datasets and its unquenchable thirst for $10,000+ GPUs :)

EC2 instances and Amazon SageMaker solve all of it (and then some). By now, the elasticity and scalability benefits of AWS are well-known and they also apply to Machine Learning workloads.

Pain point #6: “We have to keep costs under control”

Ah, the permanent contradiction of scaling to the moon while spending the minimal amount of money :) Continuing with the previous theme of elasticity and scalability, Amazon SageMaker lets you create:

  • fully-managed development environments (aka “notebook instances”). Developers can pick from a wide range of instance types: the most inexpensive one is ml.t2.medium at $0.0464 per hour. You can save money by stopping them when you don’t need them and restarting them when you do. This will sound very familiar to EC2 users, but it might be news to Machine Learning teams just getting started with AWS :)
  • fully-managed training environments , created on-demand with the SageMaker SDK, terminated automatically when the training job is complete and billed per second. This guarantees that you will never overpay for training infrastructure — a common problem with physical infrastructure…and even with cloud-based clusters if you don’t manage them properly.
  • fully-managed prediction environments , created and terminated on-demand with the SageMaker SDK. Automatic scaling is now available as well, another important factor in not spending more than is required!

Closing thoughts

Amazon SageMaker is only 6 months old and we are still iterating to add the new features that customers tell us are a priority for them. Still, my contacts with AWS customers tell me that it has struck a chord with the Machine Learning community. And of course, AWS re:Invent 2018 is on the horizon: I can’t wait to see what will be announced there.

Last but not least: the service is part of the AWS free tier , so I would encourage you to give it a try, see for yourselves if it helps with your Machine Learning pain points and send me some feedback. What did you try? What did you like? More importantly, what did you miss or dislike? Please get in touch and let me know!

That’s it for today. Thanks for reading and as always, please feel free to ask me questions, here or on Twitter.

https://medium.com/media/748a06e9019942f2677bba917ef55804/href


. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .