**Hi all, Below is a study guide for those preparing for the AWS ML Associate certification.
**I have gone through the AWS Skill Builder traning and below are the notes from that training which I have hand written. This helps me to understand the concept better.

Unfortunately When pasting in dev.to the formatting is lost so I had to use a tool to retain formatting which hasn't worked as expected. If you need the raw notes then please msg me and I will send it to you.

Key Questions at Preparatory Stages

What data do you need to reach the intended output?
Do you have access to that data? If so, how much and where is it?
Which solution can you use to bring all this data into one centralized repository?

Useful Links:

Data Requirements

Representative
Relevant
Feature Rich
Consistent

Data Types:

Text
Tabular
Time Series
Image

Data Formats:

Row-based data format: CSV, Avro, RecordIO
Column-based file types: Parquet, ORC
Object notation: JSON, JSONL

Selecting Data Storage - Key Considerations:

Throughput Latency
Elasticity
Scalability
Data access patterns: Copy and Load, Sequential Streaming, Randomized Access

Storage Solutions:

Amazon EBS: Best for training workloads with frequent random I/O access
Amazon EFS: Suitable for real-time and streaming workloads requiring concurrent access
Amazon S3: Ideal for large datasets that don't need quick access (e.g., pretrained ML models)

Data Ingestion:

Methods:

Batch Ingestion
Real-Time Ingestion

Real-Time Use Cases with Amazon Kinesis:

Data Ingestion: Kinesis Data Streams for real-time data streaming
Data Processing: Use Amazon Managed Service for Apache Flink for real-time processing
Real-time Inference: Stream data to an Amazon SageMaker endpoint

Tools for Data Transfer and Extraction:

AWS CLI
AWS SDK
S3 Transfer Acceleration
AWS DMS
AWS Lambda
AWS Glue
AWS DataSync
AWS Snowball

AWS Data Merging:

AWS Glue Workflow:

Identify Data Sources
Create an AWS Glue Crawler
Generate ETL Scripts and Define Jobs
Output Results

Amazon EMR Workflow:

Ingest Streaming Data
Distribute Across EMR Clusters
Transform Data
Output to Amazon S3

Data Preparation Blogs:

Data Transformation and Validation:

Data Transformation Techniques:

Remove unnecessary data
Use Columnar Storage
Compress Data
Format Splittable Data

Data Validation Checks:

Data Integrity
Accuracy
Completeness
Consistency
Reliability
Security
Bias Detection

Feature Engineering:

Categorical Encoding:

Techniques: Label Encoding, One-Hot Encoding

Numeric Feature Engineering:

Feature Scaling: Normalization (0-1), Standardization (-1 to 1)
Binning: Group numerical features into smaller bins
Log Transformation

Model Training Concepts:

Loss Functions: Root Mean Square Error (RMSE), Log-Likelihood Loss
Optimization Techniques: Gradient Descent, Stochastic Gradient Descent
Compute Environments: CPU Training Instances, GPU-accelerated Instances

Model Deployment Steps:

Version Control
CI/CD Integration
ML Model Numbering and Deployment
Monitoring and Workflow Security

Workflow Orchestration Options:

SageMaker Pipelines
AWS Step Functions
AWS MWAA (Managed Workflows for Apache Airflow)

Hyperparameter Tuning:

Techniques:

Manual Selection
Grid Search
Random Search
Bayesian Optimization
SageMaker Automatic Model Tuning

Conclusion:

Using AWS services like SageMaker, Glue, and Kinesis, you can create scalable, efficient machine learning workflows. Leverage the right combination of data storage, processing, and model tuning to meet your business goals effectively.

Model Size and Performance Optimization

Factors Affecting Model Size:

Input Data and Features
Architecture
Problem Domain
Desired Accuracy

Model Size Reduction Techniques:

Compression
Pruning
Quantization
Knowledge Distillation

Model Performance and Evaluation

Model Evaluation Techniques:

Using data you set aside: Validation % and Test %
Establishing Evaluation Metrics
Assessing Trade-offs (performance, training time, and cost)

Common Metrics for Classification Problems:

Accuracy: ( \frac{TP + TN}{Total} )
Precision: ( \frac{TP}{TP + FP} ) – Best when the cost of false positives is high
Recall: ( \frac{TP}{TP + FN} ) – Best when the cost of false negatives is high
F1-Score: Combines Precision and Recall
AUC-ROC: Area Under the Receiver Operator Characteristic Curve

Regression Problem Metrics:

Mean Squared Error (MSE)
Root Mean Squared Error (RMSE)
R-Squared (R²)
Adjusted R-Squared

Model Deployment and Monitoring

Model Deployment Targets:

Amazon SageMaker Endpoints
Amazon EKS
Amazon ECS
AWS Lambda

Model Inference Options:

Real-time
Serverless
Asynchronous
Batch Transform

Infrastructure as Code (IaC) for ML Workflows

Benefits of IaC:

Code Management
Repeatable Provisioning
Testing and Experimentation

Tools:

AWS CloudFormation
AWS CDK (Cloud Development Kit)
Terraform
Pulumi

AWS CDK Components:

App
Stack
Construct

CDK Deployment Workflow:

cdk init <app>
cdk bootstrap
cdk synth
cdk deploy

Continuous Delivery and Deployment

Tools:

Git Repositories
AWS CodeBuild
AWS CodeDeploy
AWS CodePipeline

Deployment Strategies:

All at Once
Canary Deployment
Linear Rolling Deployment
Automatic Rollback

Retraining and Experimentation

Automatic Retraining:

Scheduled Retraining
Drift Detection
Incremental Learning

Experimentation and A/B Testing:

Test model variations to find the optimal configuration for performance.

Debugging and Explainability

SageMaker Debugger:

Improved Model Performance
Explainability and Bias Detection
Automated Monitoring and Alerts

Metrics to Monitor:

Data Quality Metrics: Missing Data, Duplicate Data, Data Drift
Model Bias Metrics: Differential Validity, Differential Prediction Bias
Model Explainability Metrics: SHAP (Shapley Additive Explanations), Feature Attribution

Conclusion

Building machine learning models on AWS requires careful consideration of data, model architecture, and deployment strategies. With services like Amazon SageMaker, AWS Glue, and AWS CDK, you can create robust, scalable, and well-optimized machine learning pipelines.

Model Deployment Steps

Version Control
CI/CD Integration
ML Model Numbering
Model Deployment
Monitoring and Workflow Security

Workflow Orchestration Options:

SageMaker Pipelines: Integrates with SageMaker Feature Store, Amazon ECR, SageMaker Training, SageMaker Processing, and Model Registry.
- Steps: Processing -> Training -> Conditional -> CreateModel -> RegisterModel -> Fail
AWS Step Functions
AWS MWAA (Managed Workflows for Apache Airflow)
Third-party Tools

Model Deployment Targets:

Amazon SageMaker Endpoints
Amazon EKS
Amazon ECS
AWS Lambda

Model Inference Options:

Real-time Inference
Serverless Inference
Asynchronous Inference
Batch Transform

Infrastructure as Code (IaC)

Benefits:

Code Management
Repeatable Provisioning
Testing or Experimentation

Tools for IaC:

AWS CloudFormation
AWS CDK (Cloud Development Kit)
Terraform
Pulumi

AWS CloudFormation Structure:

FormatVersion
Description
Metadata
Parameters
Mappings
Conditions
Resources
Outputs


yaml
Resources:
  LogicalID:
    Type: "Resource Type"
    Properties:
      Key: Value

AWS CDK Components:
App: Entry point for AWS CDK apps
Stack: Represents a single unit of deployment
Construct: Reusable components (L1, L2, L3)
Deployment Workflow:
cdk init <app>
cdk bootstrap
cdk synth
cdk deploy
Continuous Delivery and Deployment
Continuous Delivery Tools:
Git Repositories
AWS CodeBuild
AWS CloudFormation
AWS CodeDeploy
AWS CodePipeline
Deployment Types:
All at Once
Canary Deployment
Linear Rolling Deployment
Automatic Rollback
Retraining and Experimentation
Retraining Strategies:
Scheduled Retraining
Drift Detection and Automatic Retraining
Incremental Learning
Experimentation and A/B Testing:
Test multiple model configurations to select the best performing one.

Debugging and Explainability
SageMaker Debugger:
Improved Model Performance
Explainability and Bias Detection
Automated Monitoring and Alerts
Metrics to Monitor:
Data Quality Metrics: Missing Data, Duplicate Data, Data Drift
Model Bias Metrics: Differential Validity, Differential Prediction Bias
Model Explainability Metrics: SHAP (Shapley Additive Explanations), Feature Attribution

AWS ML Associate - Study Guide

Key Questions at Preparatory Stages

Useful Links:

Data Requirements

Data Types:

Data Formats:

Selecting Data Storage - Key Considerations:

Storage Solutions:

Data Ingestion:

Methods:

Real-Time Use Cases with Amazon Kinesis:

Tools for Data Transfer and Extraction:

AWS Data Merging:

AWS Glue Workflow:

Amazon EMR Workflow:

Data Preparation Blogs:

Data Transformation and Validation:

Data Transformation Techniques:

Data Validation Checks:

Feature Engineering:

Categorical Encoding:

Numeric Feature Engineering:

Model Training Concepts:

Model Deployment Steps:

Workflow Orchestration Options:

Hyperparameter Tuning:

Techniques:

Conclusion:

Model Size and Performance Optimization

Factors Affecting Model Size:

Model Size Reduction Techniques:

Model Performance and Evaluation

Model Evaluation Techniques:

Common Metrics for Classification Problems:

Regression Problem Metrics:

Model Deployment and Monitoring

Model Deployment Targets:

Model Inference Options:

Infrastructure as Code (IaC) for ML Workflows

Benefits of IaC:

Tools:

AWS CDK Components:

CDK Deployment Workflow:

Continuous Delivery and Deployment

Tools:

Deployment Strategies:

Retraining and Experimentation

Automatic Retraining:

Experimentation and A/B Testing:

Debugging and Explainability

SageMaker Debugger:

Metrics to Monitor:

Conclusion

Model Deployment Steps

Workflow Orchestration Options:

Model Deployment Targets:

Model Inference Options:

Infrastructure as Code (IaC)

Benefits:

Tools for IaC:

AWS CloudFormation Structure: