AWS ML Associate - Study Guide

🅷🅰🆁🅳🅸🅺 🅹🅾🆂🅷🅸 - Feb 5 - - Dev Community

**Hi all, Below is a study guide for those preparing for the AWS ML Associate certification.
**I have gone through the AWS Skill Builder traning and below are the notes from that training which I have hand written. This helps me to understand the concept better.

Unfortunately When pasting in dev.to the formatting is lost so I had to use a tool to retain formatting which hasn't worked as expected. If you need the raw notes then please msg me and I will send it to you.

Key Questions at Preparatory Stages

  • What data do you need to reach the intended output?
  • Do you have access to that data? If so, how much and where is it?
  • Which solution can you use to bring all this data into one centralized repository?

Useful Links:


Data Requirements

  • Representative
  • Relevant
  • Feature Rich
  • Consistent

Data Types:

  • Text
  • Tabular
  • Time Series
  • Image

Data Formats:

  • Row-based data format: CSV, Avro, RecordIO
  • Column-based file types: Parquet, ORC
  • Object notation: JSON, JSONL

Selecting Data Storage - Key Considerations:

  • Throughput Latency
  • Elasticity
  • Scalability
  • Data access patterns: Copy and Load, Sequential Streaming, Randomized Access

Storage Solutions:

  • Amazon EBS: Best for training workloads with frequent random I/O access
  • Amazon EFS: Suitable for real-time and streaming workloads requiring concurrent access
  • Amazon S3: Ideal for large datasets that don't need quick access (e.g., pretrained ML models)

Data Ingestion:

Methods:

  • Batch Ingestion
  • Real-Time Ingestion

Real-Time Use Cases with Amazon Kinesis:

  1. Data Ingestion: Kinesis Data Streams for real-time data streaming
  2. Data Processing: Use Amazon Managed Service for Apache Flink for real-time processing
  3. Real-time Inference: Stream data to an Amazon SageMaker endpoint

Tools for Data Transfer and Extraction:

  • AWS CLI
  • AWS SDK
  • S3 Transfer Acceleration
  • AWS DMS
  • AWS Lambda
  • AWS Glue
  • AWS DataSync
  • AWS Snowball

AWS Data Merging:

AWS Glue Workflow:

  1. Identify Data Sources
  2. Create an AWS Glue Crawler
  3. Generate ETL Scripts and Define Jobs
  4. Output Results

Amazon EMR Workflow:

  1. Ingest Streaming Data
  2. Distribute Across EMR Clusters
  3. Transform Data
  4. Output to Amazon S3

Data Preparation Blogs:


Data Transformation and Validation:

Data Transformation Techniques:

  • Remove unnecessary data
  • Use Columnar Storage
  • Compress Data
  • Format Splittable Data

Data Validation Checks:

  • Data Integrity
  • Accuracy
  • Completeness
  • Consistency
  • Reliability
  • Security
  • Bias Detection

Feature Engineering:

Categorical Encoding:

  • Techniques: Label Encoding, One-Hot Encoding

Numeric Feature Engineering:

  • Feature Scaling: Normalization (0-1), Standardization (-1 to 1)
  • Binning: Group numerical features into smaller bins
  • Log Transformation

Model Training Concepts:

  • Loss Functions: Root Mean Square Error (RMSE), Log-Likelihood Loss
  • Optimization Techniques: Gradient Descent, Stochastic Gradient Descent
  • Compute Environments: CPU Training Instances, GPU-accelerated Instances

Model Deployment Steps:

  1. Version Control
  2. CI/CD Integration
  3. ML Model Numbering and Deployment
  4. Monitoring and Workflow Security

Workflow Orchestration Options:

  • SageMaker Pipelines
  • AWS Step Functions
  • AWS MWAA (Managed Workflows for Apache Airflow)

Hyperparameter Tuning:

Techniques:

  • Manual Selection
  • Grid Search
  • Random Search
  • Bayesian Optimization
  • SageMaker Automatic Model Tuning

Conclusion:

Using AWS services like SageMaker, Glue, and Kinesis, you can create scalable, efficient machine learning workflows. Leverage the right combination of data storage, processing, and model tuning to meet your business goals effectively.

Model Size and Performance Optimization

Factors Affecting Model Size:

  • Input Data and Features
  • Architecture
  • Problem Domain
  • Desired Accuracy

Model Size Reduction Techniques:

  • Compression
  • Pruning
  • Quantization
  • Knowledge Distillation

Model Performance and Evaluation

Model Evaluation Techniques:

  • Using data you set aside: Validation % and Test %
  • Establishing Evaluation Metrics
  • Assessing Trade-offs (performance, training time, and cost)

Common Metrics for Classification Problems:

  • Accuracy: ( \frac{TP + TN}{Total} )
  • Precision: ( \frac{TP}{TP + FP} ) – Best when the cost of false positives is high
  • Recall: ( \frac{TP}{TP + FN} ) – Best when the cost of false negatives is high
  • F1-Score: Combines Precision and Recall
  • AUC-ROC: Area Under the Receiver Operator Characteristic Curve

Regression Problem Metrics:

  • Mean Squared Error (MSE)
  • Root Mean Squared Error (RMSE)
  • R-Squared (R²)
  • Adjusted R-Squared

Model Deployment and Monitoring

Model Deployment Targets:

  • Amazon SageMaker Endpoints
  • Amazon EKS
  • Amazon ECS
  • AWS Lambda

Model Inference Options:

  • Real-time
  • Serverless
  • Asynchronous
  • Batch Transform

Infrastructure as Code (IaC) for ML Workflows

Benefits of IaC:

  • Code Management
  • Repeatable Provisioning
  • Testing and Experimentation

Tools:

  • AWS CloudFormation
  • AWS CDK (Cloud Development Kit)
  • Terraform
  • Pulumi

AWS CDK Components:

  • App
  • Stack
  • Construct

CDK Deployment Workflow:

  1. cdk init <app>
  2. cdk bootstrap
  3. cdk synth
  4. cdk deploy

Continuous Delivery and Deployment

Tools:

  • Git Repositories
  • AWS CodeBuild
  • AWS CodeDeploy
  • AWS CodePipeline

Deployment Strategies:

  • All at Once
  • Canary Deployment
  • Linear Rolling Deployment
  • Automatic Rollback

Retraining and Experimentation

Automatic Retraining:

  • Scheduled Retraining
  • Drift Detection
  • Incremental Learning

Experimentation and A/B Testing:

  • Test model variations to find the optimal configuration for performance.

Debugging and Explainability

SageMaker Debugger:

  • Improved Model Performance
  • Explainability and Bias Detection
  • Automated Monitoring and Alerts

Metrics to Monitor:

  • Data Quality Metrics: Missing Data, Duplicate Data, Data Drift
  • Model Bias Metrics: Differential Validity, Differential Prediction Bias
  • Model Explainability Metrics: SHAP (Shapley Additive Explanations), Feature Attribution

Conclusion

Building machine learning models on AWS requires careful consideration of data, model architecture, and deployment strategies. With services like Amazon SageMaker, AWS Glue, and AWS CDK, you can create robust, scalable, and well-optimized machine learning pipelines.

Model Deployment Steps

  1. Version Control
  2. CI/CD Integration
  3. ML Model Numbering
  4. Model Deployment
  5. Monitoring and Workflow Security

Workflow Orchestration Options:

  • SageMaker Pipelines: Integrates with SageMaker Feature Store, Amazon ECR, SageMaker Training, SageMaker Processing, and Model Registry.
    • Steps: Processing -> Training -> Conditional -> CreateModel -> RegisterModel -> Fail
  • AWS Step Functions
  • AWS MWAA (Managed Workflows for Apache Airflow)
  • Third-party Tools

Model Deployment Targets:

  • Amazon SageMaker Endpoints
  • Amazon EKS
  • Amazon ECS
  • AWS Lambda

Model Inference Options:

  • Real-time Inference
  • Serverless Inference
  • Asynchronous Inference
  • Batch Transform

Infrastructure as Code (IaC)

Benefits:

  • Code Management
  • Repeatable Provisioning
  • Testing or Experimentation

Tools for IaC:

  • AWS CloudFormation
  • AWS CDK (Cloud Development Kit)
  • Terraform
  • Pulumi

AWS CloudFormation Structure:

  1. FormatVersion
  2. Description
  3. Metadata
  4. Parameters
  5. Mappings
  6. Conditions
  7. Resources
  8. Outputs

yaml
Resources:
  LogicalID:
    Type: "Resource Type"
    Properties:
      Key: Value

AWS CDK Components:
App: Entry point for AWS CDK apps
Stack: Represents a single unit of deployment
Construct: Reusable components (L1, L2, L3)
Deployment Workflow:
cdk init <app>
cdk bootstrap
cdk synth
cdk deploy
Continuous Delivery and Deployment
Continuous Delivery Tools:
Git Repositories
AWS CodeBuild
AWS CloudFormation
AWS CodeDeploy
AWS CodePipeline
Deployment Types:
All at Once
Canary Deployment
Linear Rolling Deployment
Automatic Rollback
Retraining and Experimentation
Retraining Strategies:
Scheduled Retraining
Drift Detection and Automatic Retraining
Incremental Learning
Experimentation and A/B Testing:
Test multiple model configurations to select the best performing one.

Debugging and Explainability
SageMaker Debugger:
Improved Model Performance
Explainability and Bias Detection
Automated Monitoring and Alerts
Metrics to Monitor:
Data Quality Metrics: Missing Data, Duplicate Data, Data Drift
Model Bias Metrics: Differential Validity, Differential Prediction Bias
Model Explainability Metrics: SHAP (Shapley Additive Explanations), Feature Attribution



Enter fullscreen mode Exit fullscreen mode
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .