LLMs in Amazon Bedrock: Observability Maturity Model

Indika_Wimalasuriya - Apr 27 - - Dev Community

A few weeks back, I presented at Conf42 LLM2024. My presentation was on "LLMs in AWS: Observability Maturity from Foundations to AIOps." This blog post covers the technical elements I discussed as part of my presentation.

Disclaimer: LLM observability is in two parts:

  1. Direct LLM observability, which means integration of observability capabilities into LLM itself during training, deployment, and maintenance. This allows us to gain insight into LLM's overall performance, detect anomalies or performance issues, and understand the decision-making process.

  2. Indirect LLM observability, which means integration of observability capabilities into the GenAI application instead of LLM itself. Here, the focus is mainly on understanding the internal GenAI app state while indirectly keeping an eye on our LLM.

In my Observability Maturity Model for LLM, I mainly focus on Indirect LLM observability. And of course, the title says it's Amazon Bedrock-based, so naturally, you don't have access to underlying foundational models. You simply use APIs to access the model. Even if you like, you don't have any option of doing any direct LLM observability here.

Key areas I'm focusing on when it comes to indirect LLM Observability are:

  • Logging inputs and outputs to/from the LLM
  • Monitoring application performance metrics
  • Implementing anomaly detection on LLM outputs
  • Enabling human feedback loops to assess LLM output quality
  • Tracking application-level metrics like error rates and latency

My main objectives of implementing Observability for LLM-based GenAI apps in this way are:

  • Gain insight into LLM usage within the application
  • Observe the application's behavior and performance when interacting with the LLM
  • Ensure reliability of the application built around the LLM

Let's cut to the chase, presenting my Observability Maturity model for GenAI apps developed using Amazon Bedrock.

Observability Maturity Model for GenAI Apps

Now, let's cover a few details of each level I have come up with here:

Level 1: Foundation LLM Observability

At the foundation level, I'm focusing on providing the basics and keeping the lights on. At this level, establish logging, monitoring, and visualization systems for prompt properties, LLM events, and outputs, while implementing distributed tracing and basic security measures to ensure effective system management and compliance.

  • Prompt Properties: Log and monitor basic prompt properties such as prompt content and prompt source to track variations and effectiveness.
  • Logs: Implement basic logging for LLM-related events and outputs to maintain a record of system activities and troubleshoot issues effectively.
  • Distributed Tracing: Implement distributed tracing to understand the flow of requests through your LLM-powered application, aiding in performance optimization and debugging.
  • Visualizations: Set up basic visualizations and dashboards for LLM metrics and logs to gain insights into model behavior and performance trends.
  • Alert and Incident Management: Establish basic alert and incident management processes for LLM-related issues to ensure timely detection and resolution of anomalies or failures.
  • Security Compliance: Implement basic monitoring for LLM usage and cost controllers to ensure compliance with security standards and prevent unauthorized access or misuse.
  • Cost Optimization: Implement basic cost optimization strategies for LLM inference to manage expenses efficiently and maximize the value derived from model deployment.

Level 2: Proactive LLM Observability

At this level, I'm focusing on elevating the foundation to an advanced level, delving deep into metrics and bringing in advanced insights. This proactive stage allows for the identification of issues proactively and prompt resolution. At the advanced level, deploy comprehensive capabilities including advanced LLM metric analysis, optimization of prompt properties, enhanced alert management, robust security compliance, advanced cost optimization, utilization of AWS forecasting services, and anomaly detection mechanisms for both metrics and logs, enabling proactive insights and resolution in LLM operations.

  • LLM Metrics: Capture and analyze advanced LLM metrics, including model performance and output quality, for deeper insights.
  • Prompt Properties: Log and monitor advanced prompt properties such as prompt performance and prompt versioning to optimize model inputs.
  • Alert and Incident Management: Enhance alert and incident management processes specifically tailored for LLM-related issues to ensure proactive response and resolution.
  • Security Compliance: Ensure LLM security compliance and implement robust monitoring mechanisms to safeguard against potential threats.
  • Cost Optimization: Implement advanced strategies for cost optimization related to LLM inference, maximizing efficiency and resource utilization.
  • AWS Forecasting: Utilize AWS forecasting services to predict future LLM usage and associated costs, enabling better resource planning.
  • Metric Anomaly Detection: Set up metric anomaly detection mechanisms to identify unusual patterns or deviations in LLM metrics for early anomaly detection.
  • Log Anomaly Detection: Implement log anomaly detection techniques to detect abnormal patterns in LLM logs, facilitating proactive troubleshooting and problem resolution.

Level 3: Advanced LLM Observability with AIOps

Finally, it's about Advanced LLM Observability with AIOps. Here, I'm focusing on bringing in AI capabilities such as anomaly detection, forecasting, noise reduction, and many more. It's about applying AI/ML to your pillars and going to the next level of having the ability to truly eliminate issues even before they materialize. At the final level of LLM observability, advanced features include capturing and analyzing sophisticated LLM metrics, optimizing prompt properties, enhancing alert management, ensuring security compliance, implementing cost optimization strategies, leveraging AWS forecasting, and setting up anomaly detection mechanisms for proactive issue resolution.

  • LLM Metrics: Capture and analyze advanced LLM metrics, including model performance and output quality, for deeper insights.
  • Prompt Properties: Log and monitor advanced prompt properties, such as prompt performance and prompt versioning, to optimize model inputs.
  • Alert and Incident Management: Enhance alert and incident management processes specifically tailored for LLM-related issues to ensure proactive response and resolution.
  • Security Compliance: Ensure LLM security compliance and implement robust monitoring mechanisms to safeguard against potential threats.
  • Cost Optimization: Implement advanced cost optimization strategies for LLM inference, maximizing efficiency and resource utilization.
  • AWS Forecasting: Leverage AWS forecasting services to predict future LLM usage and associated costs, enabling better resource planning.
  • Metric Anomaly Detection: Set up metric anomaly detection mechanisms to identify unusual patterns or deviations in LLM metrics for early anomaly detection.
  • Log Anomaly Detection: Implement log anomaly detection techniques to detect abnormal patterns in LLM logs, facilitating proactive troubleshooting and problem resolution. A few things I want to call out here are:

LLM Specific Metrics

I have given focused attention to identifying the key LLM-specific metrics. It's important to cover these as part of your implementation. This will be a checklist you want to follow.

Telemetry Data Description
LLM Metrics Capture and analyze advanced LLM metrics (e.g., model performance, output quality)
Prompt Properties Log and monitor advanced prompt properties (e.g., prompt performance, prompt versioning)
Alert and Incident Management Enhance alert and incident management processes for LLM-related issues
Security Compliance Ensure LLM security compliance and monitoring
Cost Optimization Implement advanced cost optimization strategies for LLM inference
AWS Forecasting Leverage AWS forecasting services to predict future LLM usage and costs
Metric Anomaly Detection Set up metric anomaly detection for LLM metrics
Log Anomaly Detection Implement log anomaly detection to identify unusual patterns in LLM logs
LLM Model Drift - Monitor the distribution of the LLM's output within the Bedrock application over time.
- Identify significant changes in the output distribution that may indicate model drift.
- Track the performance of the LLM on a held-out evaluation set or benchmark tasks specific to the Bedrock application.
LLM Cost Optimization - Monitor the cost associated with LLM inference requests within the Bedrock application.
- Track the cost per inference and the total cost over time.
- Identify opportunities for cost optimization, such as caching or batching inference requests.
LLM Integration Errors - Monitor and log any errors or exceptions that occur during the integration of the LLM with the Bedrock application.
- Track the frequency and severity of integration errors.
- Identify and troubleshoot issues related to the integration code or the communication between the Bedrock application and the LLM service.
LLM Ethical Considerations - Monitor the LLM's output within the Bedrock application for potential ethical risks or violations.
- Track instances of harmful, illegal, or discriminatory content generated by the LLM.
- Ensure that the LLM's output aligns with established ethical principles and guidelines for responsible AI development and deployment.

Prompt Engineering Properties

Then, it's all about continuous improvement related to your prompt. Therefore, you have to keep track of the following important aspects:

  • Temperature: Controls randomness in model output. Higher temperatures yield more diverse responses; lower temperatures, more focused.
  • Top-p Sampling: Controls output diversity by considering only the most probable tokens.
  • Top-k Sampling: Considers only the k most probable tokens for generating the next token.
  • Max Token Length: Sets the maximum length of generated text.
  • Stop Tokens: Signals the model to stop generating text when encountered.
  • Repetition Penalty: Penalizes the model for repeating text, encouraging diversity.
  • Presence Penalty: Penalizes the model for generating already generated tokens.
  • Batch Size: Determines the number of input sequences processed simultaneously.
  • Inference Latency: Time taken for the model to generate output given input.
  • Model Accuracy & Metrics: Task-specific metrics like accuracy, perplexity, or BLEU score.

Performance Metrics, Logging, and Tracing for RAG models (Retrieval augmented generation)

Finally, it's about what you do with RAG. Monitor the performance and behavior of the RAG model through metrics such as query latency and success rate, logs including query and error logs, and tracers for end-to-end tracing to ensure optimal functionality and effectively troubleshoot issues.

Aspect Description
Metrics - Query latency: Track the time it takes for the RAG model to process a query and generate a response. This can help identify performance bottlenecks.
- Success rate: Monitor the percentage of successful queries versus failed queries. This can indicate issues with the model or the underlying infrastructure.
- Resource utilization: Monitor the CPU, memory, and network usage of the RAG model and the associated services. This can help with capacity planning and identifying resource constraints.
- Cache hit rate: If you're using a cache for retrieved documents, monitor the cache hit rate to understand its effectiveness.
Logs - Query logs: Log the input queries, retrieved documents, and generated responses. This can aid in debugging and understanding the model's behavior.
- Error logs: Log any errors or exceptions that occur during query processing or document retrieval. This can help identify and troubleshoot issues.
- Audit logs: Log user interactions, authentication events, and any sensitive operations for security and compliance purposes.
Tracers - End-to-end tracing: Implement distributed tracing to track the flow of a query through the various components of the RAG model, including document retrieval, encoding, and generation.

There are many more I have discussed during my presentation which you can go through. I hope this provides you with enough detail to make a start bringing observability into your LLMs.

You can watch the full presentation here.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .