Evaluating generative AI performance: When your data is “anything”

Tal Reisfeld - Dec 11 '23 - - Dev Community

Behind the scenes of evaluating the industry’s first generative AI observability assistant. To read the full New Relic blog, click here.


Our mission behind building New Relic AI was to unlock observability for all; to enable any user to ask their telemetry data anything. New Relic AI achieves this by letting customers interact with the New Relic database (NRDB) using natural language, reducing their reliance on New Relic Query Language (NRQL) proficiency, and streamlining analysis across their tech stacks. Additional capabilities include synthesizing insights from our documentation and knowledge base, surfacing system anomalies, and more.

Performance evaluation

Establishing robust evaluation measures and protocols is a key aspect of working with big data, where AI models thrive. With countless decisions to make, it's essential to evaluate them systematically and at scale. Traditional machine learning (ML) algorithms often have well-defined and established evaluation metrics to achieve this. For instance, in a vehicle classification problem, you could ask, "Did I correctly identify that car on the road?" or for a regression problem, "How far was my predicted number from the actual result?"

With the introduction of ChatGPT, OpenAI has raised the bar on what we can achieve using generative AI by a step increase. Their newly established performance is such a leap forward from anything we’ve seen before that it has not only enhanced existing capabilities, but also unlocked an a whole new set of possibilities. Yet, this new technology doesn't come without challenges, one of which is performance validation. While new generative AI features are emerging every day, standard evaluation protocols and metrics for measuring the underlying tasks have yet to be established. The unstructured nature of natural language output makes formulating these protocols a significant undertaking. We—the New Relic AI development team—paid close attention to this aspect as it’s crucial for creating a truly trustworthy feature that delivers real value to users.

Our goal with New Relic AI was to create a tool that can answer anything you ask. As with any big data project, we need to evaluate the answers to make informed design decisions and optimize performance. But how can we determine if we got anything right? In this post, we share some of the measures we took to tackle this problem.

Strategy 1: Identifying areas where clear-cut measures of success can be applied

Although natural language is unstructured and generative AI-based systems like New Relic AI can be complex, we can break them down into components and identify parts of the system where clear-cut measures of success can be defined. The design of New Relic AI is modularized, and the manageable, independent components make the evaluation of the entire system more systematic. Throughout all stages of design and implementation, we consistently considered the importance of measurability and pinpointed specific areas where such measures could be applied.

Example 1: Classifying user requests by use case

One such area is the classification of user questions into different flows, which is the first stage of processing each user request. When you ask New Relic AI a question, it first decides which flow the question should go to: Is it a general question about New Relic (in which case, we'll search our documentation), a question about the user's own system (requiring a NRQL query), or a request for a general overview of errors in the system (calling the Lookout service)? Although this "routing" stage involves generative AI, it’s essentially a classification problem, allowing us to use established evaluation metrics for classification. This is crucial for evaluating whether we were on the right track with our approach to the solution, and for scaling New Relic AI capabilities by adding more possible flows in the future.

Example 2: Syntax validation in NL2NRQL

Another example is translating natural language to NRQL (NL2NRQL), where we take user input in natural language and convert it into an NRQL query. New Relic AI is able to perform NR2NRQL, run the query against NRDB, fetch the data, and then explain the results. If you've used ChatGPT for programming help, you might have encountered syntax hallucinations, where a code block is perfectly structured, but some keyword, function, or attribute doesn't exist. It was essential for us to avoid this issue with NL2NRQL and provide users with valid queries that actually fetch their data. To achieve this, we can test the validity of our generated NRQL by running it through our parser and compiler for syntax evaluation, then against NRDB to ensure it returns results. New Relic AI performs syntax validation continuously in real time, and only returns queries to users that are syntactically correct. This offers a clear, binary measure of success for this aspect of the system—one of the multiple reasons we wanted to develop around NRQL from the beginning.

As a note, OpenAI recently adopted a similar approach with their new code interpreter plug-in, where they run their suggested code to ensure its correctness.

Strategy 2: Tackling hallucinations: Retrieval-augmented generation (RAG) and reframing the task

Whenever we mentioned our work with generative AI to colleagues or friends, they expressed a common concern: "What about hallucinations?"

Hallucinations refer to a known pitfall of these generative AI models, which is their tendency to make up facts with confidence. Essentially, the responses are always presented confidently regardless of whether they're accurate or correct (recall our syntax hallucinations example). Moreover, OpenAI's models have a knowledge cutoff limiting their built-in knowledge.

For New Relic AI (currently powered entirely by GPT-4), this issue is especially relevant when answering general or knowledge-based questions about New Relic. We aim to provide 100% accurate and up-to-date information in all of our responses.

Using RAG

Retrieval-augmented generation (RAG) is now an established common practice for dealing with hallucinations. It involves retrieving relevant information from an external database and incorporating it into the language model's prompt. The reasoning behind this is straightforward: if you're unsure whether your model's built-in knowledge is accurate for your use case (which is often the case), and since you know it doesn't contain the most recent facts, you can bring the facts to the model from an external source. With New Relic AI, that external source is our docs stored in a Pinecone vector database.

Reframing the task

Once we’ve brought relevant context into the model with RAG, we need to ensure the model actually relies on it instead of answering from built-in knowledge. This means adjusting how we present the user's question to the model when the question relates to general New Relic knowledge.

To illustrate this, let's use an example. Imagine you're given a task with specific instructions. In the first scenario, you're simply asked to answer a question like, "Who was the third president of the United States?" With this phrasing, it's reasonable to assume you're expected to answer using your prior knowledge. Now, imagine a second setting where you're asked the same question, but you're also given some text that might help you. In this case, we've employed RAG, and if our goal is to maximize response accuracy, this approach would likely yield better results. Finally, let’s take this a step further. Consider a third setting where your instructions are: “Here is some text; your task is to determine whether the answer to the question, ‘Who was the third president of the United States?’ can be found within it, and if so, provide an explanation.” The third setting is an entirely different task from the first, as it focuses on text comprehension rather than knowledge recall.

Since knowledge accuracy is a potential pitfall of generative AI (hallucinations), and summarization and text comprehension are its strengths, we phrase our prompt following the concept of the third setting.

Strategy 3: Promoting trustworthiness via transparency and clear communication

An additional step we take to increase New Relic AI trustworthiness is providing users the information they need to evaluate the answers themselves. If we offer answers without context, we might leave users unsure of their validity. To address this, we emphasize clear communication throughout the process, adding intermediate steps at various stages of the Q&A flow.

For example, when users ask about their own data (for example, "How many transactions did I have today?") New Relic AI first informs them it will convert their question into an NRQL query. This way, users know that we're using NRQL to answer their question and can already assess whether this approach makes sense. Next, the AI provides the generated query (for example,

SELECT count(*) FROM Transaction SINCE TODAY

), offers a visualization created by querying NerdGraph—our GraphQL API—with the generated query, and finally presents the answer based on the data returned from NRDB (for example, "16,443,606,861"). By sharing the same information used to derive the final answer, users can better understand and trust the response. Even if they're not proficient in NRQL, the query syntax and keywords are often self-explanatory, making it easier to comprehend.

Similarly, when the AI answers a knowledge question about New Relic, it first informs the user that it will search the docs and then provides links to the references it used alongside the answer.

Listening to feedback

Our users serve as the ultimate judges of New Relic capabilities, and we’re continuously collecting feedback. We're grateful to all external and internal users who have contributed their valuable insights.

If you're using New Relic AI, we'd love to hear your thoughts. Please continue to help us improve by rating its responses with the thumbs up/thumbs down buttons available on each response, or by submitting additional feedback via Help > Give Us Feedback on the left navigation panel.


To read the full New Relic blog, click here.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .