**Hi all, Below is a study guide for those preparing for the AWS ML Associate certification.
**I have gone through the AWS Skill Builder traning and below are the notes from that training which I have hand written. This helps me to understand the concept better.
Unfortunately When pasting in dev.to the formatting is lost so I had to use a tool to retain formatting which hasn't worked as expected. If you need the raw notes then please msg me and I will send it to you.
Key Questions at Preparatory Stages
- What data do you need to reach the intended output?
- Do you have access to that data? If so, how much and where is it?
- Which solution can you use to bring all this data into one centralized repository?
Useful Links:
- Amazon SageMaker Documentation
- AWS Responsible AI
- Large Language Models on AWS
- Difference Between Machine Learning Supervised and Unsupervised
Data Requirements
- Representative
- Relevant
- Feature Rich
- Consistent
Data Types:
- Text
- Tabular
- Time Series
- Image
Data Formats:
- Row-based data format: CSV, Avro, RecordIO
- Column-based file types: Parquet, ORC
- Object notation: JSON, JSONL
Selecting Data Storage - Key Considerations:
- Throughput Latency
- Elasticity
- Scalability
- Data access patterns: Copy and Load, Sequential Streaming, Randomized Access
Storage Solutions:
- Amazon EBS: Best for training workloads with frequent random I/O access
- Amazon EFS: Suitable for real-time and streaming workloads requiring concurrent access
- Amazon S3: Ideal for large datasets that don't need quick access (e.g., pretrained ML models)
Data Ingestion:
Methods:
- Batch Ingestion
- Real-Time Ingestion
Real-Time Use Cases with Amazon Kinesis:
- Data Ingestion: Kinesis Data Streams for real-time data streaming
- Data Processing: Use Amazon Managed Service for Apache Flink for real-time processing
- Real-time Inference: Stream data to an Amazon SageMaker endpoint
Tools for Data Transfer and Extraction:
- AWS CLI
- AWS SDK
- S3 Transfer Acceleration
- AWS DMS
- AWS Lambda
- AWS Glue
- AWS DataSync
- AWS Snowball
AWS Data Merging:
AWS Glue Workflow:
- Identify Data Sources
- Create an AWS Glue Crawler
- Generate ETL Scripts and Define Jobs
- Output Results
Amazon EMR Workflow:
- Ingest Streaming Data
- Distribute Across EMR Clusters
- Transform Data
- Output to Amazon S3
Data Preparation Blogs:
Data Transformation and Validation:
Data Transformation Techniques:
- Remove unnecessary data
- Use Columnar Storage
- Compress Data
- Format Splittable Data
Data Validation Checks:
- Data Integrity
- Accuracy
- Completeness
- Consistency
- Reliability
- Security
- Bias Detection
Feature Engineering:
Categorical Encoding:
- Techniques: Label Encoding, One-Hot Encoding
Numeric Feature Engineering:
- Feature Scaling: Normalization (0-1), Standardization (-1 to 1)
- Binning: Group numerical features into smaller bins
- Log Transformation
Model Training Concepts:
- Loss Functions: Root Mean Square Error (RMSE), Log-Likelihood Loss
- Optimization Techniques: Gradient Descent, Stochastic Gradient Descent
- Compute Environments: CPU Training Instances, GPU-accelerated Instances
Model Deployment Steps:
- Version Control
- CI/CD Integration
- ML Model Numbering and Deployment
- Monitoring and Workflow Security
Workflow Orchestration Options:
- SageMaker Pipelines
- AWS Step Functions
- AWS MWAA (Managed Workflows for Apache Airflow)
Hyperparameter Tuning:
Techniques:
- Manual Selection
- Grid Search
- Random Search
- Bayesian Optimization
- SageMaker Automatic Model Tuning
Conclusion:
Using AWS services like SageMaker, Glue, and Kinesis, you can create scalable, efficient machine learning workflows. Leverage the right combination of data storage, processing, and model tuning to meet your business goals effectively.
Model Size and Performance Optimization
Factors Affecting Model Size:
- Input Data and Features
- Architecture
- Problem Domain
- Desired Accuracy
Model Size Reduction Techniques:
- Compression
- Pruning
- Quantization
- Knowledge Distillation
Model Performance and Evaluation
Model Evaluation Techniques:
- Using data you set aside: Validation % and Test %
- Establishing Evaluation Metrics
- Assessing Trade-offs (performance, training time, and cost)
Common Metrics for Classification Problems:
- Accuracy: ( \frac{TP + TN}{Total} )
- Precision: ( \frac{TP}{TP + FP} ) – Best when the cost of false positives is high
- Recall: ( \frac{TP}{TP + FN} ) – Best when the cost of false negatives is high
- F1-Score: Combines Precision and Recall
- AUC-ROC: Area Under the Receiver Operator Characteristic Curve
Regression Problem Metrics:
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- R-Squared (R²)
- Adjusted R-Squared
Model Deployment and Monitoring
Model Deployment Targets:
- Amazon SageMaker Endpoints
- Amazon EKS
- Amazon ECS
- AWS Lambda
Model Inference Options:
- Real-time
- Serverless
- Asynchronous
- Batch Transform
Infrastructure as Code (IaC) for ML Workflows
Benefits of IaC:
- Code Management
- Repeatable Provisioning
- Testing and Experimentation
Tools:
- AWS CloudFormation
- AWS CDK (Cloud Development Kit)
- Terraform
- Pulumi
AWS CDK Components:
- App
- Stack
- Construct
CDK Deployment Workflow:
-
cdk init <app>
-
cdk bootstrap
-
cdk synth
-
cdk deploy
Continuous Delivery and Deployment
Tools:
- Git Repositories
- AWS CodeBuild
- AWS CodeDeploy
- AWS CodePipeline
Deployment Strategies:
- All at Once
- Canary Deployment
- Linear Rolling Deployment
- Automatic Rollback
Retraining and Experimentation
Automatic Retraining:
- Scheduled Retraining
- Drift Detection
- Incremental Learning
Experimentation and A/B Testing:
- Test model variations to find the optimal configuration for performance.
Debugging and Explainability
SageMaker Debugger:
- Improved Model Performance
- Explainability and Bias Detection
- Automated Monitoring and Alerts
Metrics to Monitor:
- Data Quality Metrics: Missing Data, Duplicate Data, Data Drift
- Model Bias Metrics: Differential Validity, Differential Prediction Bias
- Model Explainability Metrics: SHAP (Shapley Additive Explanations), Feature Attribution
Conclusion
Building machine learning models on AWS requires careful consideration of data, model architecture, and deployment strategies. With services like Amazon SageMaker, AWS Glue, and AWS CDK, you can create robust, scalable, and well-optimized machine learning pipelines.
Model Deployment Steps
- Version Control
- CI/CD Integration
- ML Model Numbering
- Model Deployment
- Monitoring and Workflow Security
Workflow Orchestration Options:
-
SageMaker Pipelines: Integrates with SageMaker Feature Store, Amazon ECR, SageMaker Training, SageMaker Processing, and Model Registry.
- Steps:
Processing -> Training -> Conditional -> CreateModel -> RegisterModel -> Fail
- Steps:
- AWS Step Functions
- AWS MWAA (Managed Workflows for Apache Airflow)
- Third-party Tools
Model Deployment Targets:
- Amazon SageMaker Endpoints
- Amazon EKS
- Amazon ECS
- AWS Lambda
Model Inference Options:
- Real-time Inference
- Serverless Inference
- Asynchronous Inference
- Batch Transform
Infrastructure as Code (IaC)
Benefits:
- Code Management
- Repeatable Provisioning
- Testing or Experimentation
Tools for IaC:
- AWS CloudFormation
- AWS CDK (Cloud Development Kit)
- Terraform
- Pulumi
AWS CloudFormation Structure:
- FormatVersion
- Description
- Metadata
- Parameters
- Mappings
- Conditions
- Resources
- Outputs
yaml
Resources:
LogicalID:
Type: "Resource Type"
Properties:
Key: Value
AWS CDK Components:
App: Entry point for AWS CDK apps
Stack: Represents a single unit of deployment
Construct: Reusable components (L1, L2, L3)
Deployment Workflow:
cdk init <app>
cdk bootstrap
cdk synth
cdk deploy
Continuous Delivery and Deployment
Continuous Delivery Tools:
Git Repositories
AWS CodeBuild
AWS CloudFormation
AWS CodeDeploy
AWS CodePipeline
Deployment Types:
All at Once
Canary Deployment
Linear Rolling Deployment
Automatic Rollback
Retraining and Experimentation
Retraining Strategies:
Scheduled Retraining
Drift Detection and Automatic Retraining
Incremental Learning
Experimentation and A/B Testing:
Test multiple model configurations to select the best performing one.
Debugging and Explainability
SageMaker Debugger:
Improved Model Performance
Explainability and Bias Detection
Automated Monitoring and Alerts
Metrics to Monitor:
Data Quality Metrics: Missing Data, Duplicate Data, Data Drift
Model Bias Metrics: Differential Validity, Differential Prediction Bias
Model Explainability Metrics: SHAP (Shapley Additive Explanations), Feature Attribution