Revolutionizing ML Training: How Amazon Search M5 Achieved 30% Cost Savings with AWS Trainium

Sidra Saleem - Feb 14 - - Dev Community

Overview of Amazon's ML Innovations

For the past decade, the leader of innovation in Machine learning (ML) is the amazon. Machine learning has been used constantly for better customer experience. Machine learning has been used by Amazon in various scenarios which include algorithm improvement and fraud detection.

Evolution from Traditional ML to Neural Networks and Deep Learning

Over time, improvement in machine learning has also been done. With all the shifts, hardware acceleration for the exploration of latest architectural models has been embraced. Because of this advancement, the team of Amazon was empowered to jump into the Multi-entity, multi-locale, multitask along with encompassing video, text and images.

Introduction to the M5 Program within Amazon Search

The M5 program is very important within Amazon Search, leading the learning strategy of discovery for the company. Large scale models with diverse aspect coverage are built using these programs. These include multitasking capabilities, multiple entities along with the Multimodalities like video, text and images. These models serve as the universal plant and foundations, serving more than hindered teams on Amazon. The m% program mainly focuses on serving ML Models with high quality at a low cost.

Utilizing AWS Accelerators

Introduction to AWS Inferentia and Trainium

Purpose built acceleration is served by Amazon Web Service (AWS), which includes Trainium along with Inferentia. Trainium and Inferentia helps with the better Machine learning. Inferentia was produced in the year 2020, acceleration in workload efficient deployment is provided by Inferentia. This then assists in both, latency upgrade and cost.

Overview of Trainium Trn1 and Trn1n Instances

The two instances of Trainium which are Trn1 & Trn1n are solely designed to serve the needs of training of large scale which have billions of parameters. The NeuronCore-V2 is the instance powered by Trn1 which helps with memory and acceleration. Trn1n, it helps with the network bandwidth.

Importance of Accelerators in ML Organizations

A very important role is played by the accelerators in ML organization. It helps with improvement of speed along with efficiency for both processes of training and inference. High performance and cost optimization by ML teams can be achieved using Trainium.

Integration of Neuron SDK and PyTorch XLA for Accelerator Support

A robust layer of software is very important in order to harness the acceleration power. PyTorch XLA mixed with the Neuron SDK, helps with the integration of accelerators into ML workflows. An example of the PyTorch XLA and Neuron SDK employment is given below:

import torch_xla.core.xla_model as xm

# Set up XLA device

xla_device = xm.xla_device()

# Define your PyTorch model

model = MyModel(*args, **kwargs)

# Load model state from a checkpoint

model.load_state_dict(torch.load(PATH))

# Move the model to the XLA device

model.to(xla_device)

This integration is pivotal for the M5 team's exploration of Trainium to achieve cost-effective model training.

Integration of Neuron SDK and PyTorch XLA for Accelerator Support (Continued)

Example Code for Using Neuron SDK and PyTorch XLA

import torch

import torch_xla.core.xla_model as xm

import torch_xla.distributed.xla_multiprocessing as xmp

import torch_xla.distributed.parallel_loader as pl

import torch_xla.utils.serialization as xser

# Set up XLA devices

devices = xm.get_xla_supported_devices()

device = xm.xla_device()

# Define your PyTorch model

model = MyModel(*args, **kwargs)

# Move the model to the XLA device

model.to(device)

# Define loss function and optimizer

loss_fn = torch.nn.CrossEntropyLoss()

optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

# Wrap the model and optimizer with PyTorch XLA ParallelLoader

model, optimizer = pl.ParallelLoader(model, [optimizer], device_ids=[device]).load()

# Training loop

for epoch in range(num_epochs):

    for batch in train_loader:

        inputs, labels = batch

        inputs, labels = inputs.to(device), labels.to(device)

        # Forward pass

        outputs = model(inputs)

        loss = loss_fn(outputs, labels)

        # Backward pass and optimization

        optimizer.zero_grad()

        loss.backward()

        optimizer.step()

# Save the model

xser.save(model.state_dict(), MODEL_PATH)

The integration of PyTorch XLA and Neuron SDK for the training of Trainium as an example is given. The use of ParallelLoader of PyTorch XLA makes sure about the data distortion and its efficiency.

Model and Workload Overview

Description of the M5 Team's Training and Deployment Process

The implementation of discovery learning strategy is the responsibility of the M5 Amazon search team. The process of team deployment process and training includes the making and distribution of universal embedding models across Amazon. The service of different needs along with provision of excellent experience to amazon.com customers is what this is designed for.

Specifics of the Foundational Model - Text Encoder with MLP

The foundation model consisting of the encoder text including Multi-layer perception (MLP) is one of the core model developments of the M5 team. The explicit and implicit features are incorporated by this architecture. The model is trained of billions of tokens and millions of embeddings are generated in the setting of offline batch. Input to customer-facing tier-1 is what these embedding serves as.

Production Pipeline Infrastructure Using AWS Batch and Trn1.32xlarge Cluster

M5 team’s production pipeline replies of AWS Batch, fair share queuing strategies are used. An EFA enabled Multi-node Trn1.32xlarge cluster is used for model training. Incremental model training is what’s passed by the pipeline of production functionally. In the Service level agreement, the infrastructure is instrumental which also ensures the efficient deployment process without the regression.

Goals and Criteria for Transition to Trainium

Focus on Customer Satisfaction and SLAs

Primary focus remained on customer satisfaction while transitioning to Trainium. Service level agreements are very essential is the giving nature of the pipeline. The main goal is to make the experience of the customer of Amazon.com better.

Critical Acceptance Criteria for Transitioning to Trainium

  • Model Quality: customer experience is greatly influenced by the Model of ML. The establishment of critical criteria is done by the M5 team, the difference should be less than 0.1% model quality when related to previous GPU based pipelines. To check if the model exceeds or meets the benchmarks, rigorous testing and validation processes are used.
  • Training Throughput: another criterion of acceptance is set by the M5 team when recognizing the importance of timely model convergence. A predefined period is during which the model convergence must be achieved, searched as two weeks, to align with production SLA’s.

Ensuring Model Quality and Training Throughput

Backward work is done by the M5 team to systematically outline goals and to achieve the acceptance criteria that was defined. Adaption of Training scripts, compilation optimization for Trainium, checkpoint visibility insurance are involved in this.

Adapting Training Script for Trainium

Importance of Making the Training Script XLA Compliant

Domain specific compiler for operations of linear algebra is the XLA. This adaptation helps with the enhancement of performance and the contribution towards the achievement of the goals defined by model quality.

Use of Distributed Data Parallel (DDP) for Increased Throughput

The Distributed Data Parallel (DDP) is the technique used by the M5 team for the purpose of increasing training throughput. Scaling of Machine numbers without significant code change is enabled by DDP. The efficient and saleable training processes on Trainium are aligned with the approach of DDP.

Technical Details and Code Changes in the Training Script

Placement of xm.mark_step()

The xm.mark_step() function plays a crucial role in the training script. It gathers and executes the collected graphs of computation. Strategic placement of xm.mark_step () after a forward and backward pass is used to optimize training. Optimal balance is ensured by the attentive placement between the size and the graphs of computation.

Data Loader Wrapping with XLA Multiprocessing Device Loader

A very important step in adapting the script training involves data loader with XLA multiprocessing device loader. The torch_xla.distributed.parallel_loader.MpDeviceLoader is employed to load training data on each XLA device. Options of preload and data overlapping are provided by the wrapper. Moreover, the loader of the device invokes xm.mark_step (), allowing for the construction of graphs for data loading from the host to the device.

Compilation for Trainium

Introduction to AOT and JIT Compilation

In the preparation of a model for acceleration, Compilation is an important step. Two main approaches include: Ahead of time (AOT) approach and Justin in time (JIT) approach. As the graphs are encountered, JIT traces, compiles and runs them, whereas the AOT traces, compiles and runs the graphs before running.

Using Neuron SDK for AOT Compilation

The facility of neuron_parallel_compile is provided by Neuron SDK for AOT compilation. During the training, it is mandatory to make sure that no new graphs are made. In the process of integration, AOT Compilation is a critical step.

Enforcing Constraints for Static Shapes and Fixed-Size Batches

The creation of new graphs of computation during model training can be done while dynamic shapes of training batches. Enforcement of constraints on training is what the M5 team does to mitigate this. The need for improved training and time compilation of training is eliminated by this approach.

Compiler-Level Optimization Options and Flags

The Neuron compiler provides three optimization options:

  • O1 (Optimization Level 1): Enables core optimizations on the compute graph and minimizes compilation time.
  • O2 (Optimization Level 2): Balances between compilation time and model run throughput.
  • O3 (Optimization Level 3): Provides improved model run throughput at the cost of higher compilation time.

For the M5 team's use case, the choice is made to use the O1 optimization, which demonstrated an 86% reduction in compilation time with no adverse effects on model accuracy metrics. Below is an example code snippet showcasing the use of compilation flags?

export NEURON_CC_FLAGS="--target trn1 --auto-cast all --auto-cast-type bf16 --model-type transformer --optlevel O1"

This example sets the compilation flags, specifying the target accelerator, enabling automatic casting, defining the model type, and choosing the optimization level. The flexibility in choosing optimization levels allows teams to tailor the compilation process based on their specific use cases.

Checkpoint Compatibility

Seamless Transitioning Between GPU and Trainium

Seamless transitioning between the Trainium and GPU is very important to ensure for the M5 team workflow. This enables the team to hold compatibility for the best of both worlds. Flexibility is also maintained while selecting the model training hardware.

Use of PyTorch Checkpoint Loading Utility

The strengths of PyTorch checkpoint loading utility is to facilitate transition. The direct way of loading and saving models is provided by this utility. This also makes it possible to transition models between Trainium and GPU without any significant changes of code.

Example Code for Saving and Loading Models on GPU and Trainium

Saving the Model on GPU:

import torch

# Save the model on GPU

torch.save(model.state_dict(), PATH)

Loading the Model on Trainium:

import torch_xla.core.xla_model as xm

# Set up XLA device

xla_device = xm.xla_device()

# Define the model on Trainium

model = MyModel(*args, **kwargs)

# Load the model back from the saved checkpoint

model.load_state_dict(torch.load(PATH))

# Move the model to the XLA device

model.to(xla_device)

Saving the Model on Trainium:

import torch_xla.core.xla_model as xm

# Save the model on Trainium

xm.save(model.state_dict(), PATH)

Loading the Model on GPU:

# Define the model on GPU

model = MyModel(*args, **kwargs)

# Load the model back from the saved checkpoint on Trainium

model.load_state_dict(torch.load(PATH))

# Move the model to the desired GPU device

model.to(device)  # Can be any GPU device

This example code illustrates the process of saving a model on one hardware configuration (GPU) and seamlessly loading it onto a different hardware configuration (Trainium) using PyTorch utilities. The compatibility between GPU and Trainium ensures operational stability and flexibility for the M5 team during the transition.

Operational Stability

Importance of Operational Stability in Production Pipelines

A very important part of running ML workloads in operational stability. The ensure of a stable pipeline during the training of model, evaluation and inference is the duty of the M5 team. The stable operation reduces the downtime and problems and increases the reliability of ML workflow.

Running Sanity Tests before Model Training

Before the start of the training of the model, sanity tests are done by the M5 team to look at the health of the overall system and specifically machines. A simple tensor operation is involved in the test to ensure the functional accelerator devices. The NCCOM test suite from the Neuron SDK is utilized, running operations such as all-gather, all-reduce, and reduce-scatter to ensure proper communication and functionality.

Resilience Mechanisms for Handling Transient Issues

Knowing that the issue in any pipeline is inevitable, without any underlying accelerator, mechanism of resilience is implemented by M5 team. To address the main issues, a retry mechanism is implemented by the M5 team in the pipelines. This proactive approach minimizes the impact of transient issues and enhances the robustness of the overall pipeline.

Achieving High Success Rates with Retry Mechanisms

Example Code for AWS Batch Automated Retries

import boto3

import time

# Define AWS Batch job parameters

job_name = "your-job-name"

job_queue = "your-job-queue"

job_definition = "your-job-definition"

retry_attempts = 3  # Number of retry attempts

# Create an AWS Batch client

batch_client = boto3.client('batch')

# Define job submission parameters

job_submission_params = {

    'jobName': job_name,

    'jobQueue': job_queue,

    'jobDefinition': job_definition,

    # Additional job parameters...

}

# Submit the job to AWS Batch

response = batch_client.submit_job(**job_submission_params)

# Monitor job status and implement retry logic

for attempt in range(retry_attempts):

    # Check job status

    job_status = batch_client.describe_jobs(jobs=[response['jobId']])['jobs'][0]['status']

    if job_status == 'SUCCEEDED':

        print("Job succeeded!")

        break

    elif job_status == 'FAILED':

        print("Job failed after retry attempts.")

        break

    else:

        # Wait before retrying

        time.sleep(60 * (2 ** attempt))  # Exponential backoff

        # Retry job submission

        response = batch_client.submit_job(**job_submission_params)

else:

    print("Maximum retry attempts reached. Consider manual intervention.")

This example code demonstrates a simple implementation of retry mechanisms for AWS Batch jobs. The script monitors the status of the submitted job and retries the job submission in case of transient failures. The use of exponential back off in the waiting period between retries helps mitigate issues caused by transient failures. Adjust the parameters and error handling based on specific requirements and characteristics of the production pipeline.

Results and Validation

Validation of Model Accuracy on a Holdout Dataset

To make sure the transition to Trainium works successfully, the m% team validates the accuracy of models on the holdout datasheet. Representative sample is what this datasheet serves as. The process of validation involves the comparison between the accuracy metrics, consistency and benchmarks of quality.

Conclusion

Factors to Consider When Evaluating Accelerators

Key considerations include:

  • Model Quality: The quality should not be compromised by the accelerator. Production of accurate and believable results is also necessary.
  • Throughput: The ability to handle large training is an important factor of the accelerator.
  • Cost-effectiveness: For the accession of implications of cost, a specific accelerator is used. Best performance achievement along with maximum efficiency of ML workflow is the end goal.

Achievements and Cost Savings with Trainium

The M5 team successfully reduced the cost of training models by 30%, showcasing the efficiency and effectiveness of Trainium accelerators. The achievements include:

  • Model Quality Parity: Trainium trained models act better as compared to the ones which are GPU trained.
  • Training Throughput: Trainium is the powerful solution within a specific period for achieving model convergence. Also for meeting production SLA’s and betterment of customer experience.
  • Cost Savings: The cost-effectiveness analysis revealed up to 30% cheaper cost to model GPU.

Encouragement to Explore Trainium and Neuron Devices for ML Workloads

At the end, M5 team succession rate with Trainium serves at the ML encouragement and organisations for the purpose of exploring Trainium capabilities. By considering factors which are operational stability and cost, optimal results can easily be achieved in ML workloads. The commitment to innovation and continuous improvement is highlighted by the ongoing partnership and collaboration with the Annapurna Neuron team.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .