Implement LLM guardrails for RAG applications

Jill Amaya - Sep 12 - - Dev Community

By: Roy Derks

In the evolving world of AI and language models, ensuring that outputs are factually accurate and relevant is crucial. Developers often rely on foundational models to generate responses based on company data, but large language models (LLM) can sometimes combine multiple pieces of information incorrectly, resulting in hallucination responses that are either inaccurate or entirely fabricated.

In this tutorial, learn how to use the contextual grounding checks that come with the guardrails functionality in watsonx Flows Engine. With watsonx Flows Engine, you can build AI applications for several use cases, including retrieval augmented generation (RAG) applications. These checks are designed to detect hallucinations in responses, especially in RAG applications, where the model pulls data from various sources to craft its answers. By utilizing LLM guardrails, you can better identify responses that are factually incorrect or irrelevant to a user’s query, helping to maintain the reliability of AI-driven applications.

Contextual grounding for RAG

Contextual grounding in watsonx Flows Engine ensures that AI outputs are reliable by anchoring responses in accurate, relevant source data. By cross-referencing model outputs with the input and relevant context from a vector database, the guardrails built into watsonx Flows Engine helps detect hallucinations or fabrications, ensuring that the responses are factually grounded. This is particularly important when using LLMs for tasks that demand high precision and credibility.

After guardrails are enabled, watsonx Flows Engine scores the input and output of a flow using three metrics or scores:

  • Answer relevance: This measures how closely the model’s output aligns with the input question. Scores range between 0 and 1, with higher scores indicating more relevant responses.

  • Context relevance: This metric assesses how well the context used in the response relates to the input. A score closer to 1 suggests that the context is highly relevant to the user's query.

  • Groundedness: Groundedness measures how well the response is anchored in the provided context. A high score means the response is solidly based on reliable sources, minimizing the risk of hallucination.

This helps you ensure that your applications provide outputs that are not only accurate but also contextually aligned with user queries. These checks enhance user trust in AI-driven applications by ensuring consistent, factually correct responses.

Deploying a RAG application with watsonx Flows Engine

To take advantage of guardrails, you need to deploy the RAG application first. For this, you’ll use watsonx Flows Engine, which lets you set up a complete RAG flow in a matter of minutes using the CLI. Using watsonx Flows Engine is completely free, and gives you access to (limited) LLM tokens for watsonx.ai and a Milvus vector database running in watsonx.data with no need to configure these connections yourself.

To build a RAG application using watsonx Flows Engine, follow these steps:

  1. Download the wxflows CLI: Install the CLI to interact with watsonx Flows Engine. For this, you must have Python installed on your local machine.

  2. Create an account: Sign up for a free account using your IBMid or GitHub to sign in.

  3. Set up your RAG application: Follow the Build a RAG application with watsonx.ai flows engine tutorial to configure your RAG application in just a few minutes.

After it's deployed, you have a vector database that is populated with data from watsonx documentation and an endpoint to interact with. To enable guardrails, you must modify your flow to include steps that measure hallucination and score responses based on the three key metrics.

Enabling guardrails in your flows

After setting up a new RAG application, you should have a wxflows.toml file on your machine that includes a set of flows. To activate guardrails, open the wxflows.toml file and include the following flow:

myRagWithGuardrails = ragAnswerInput | topNDocs | promptFromTopN | completion(parameters:myRagWithGuardrails.parameters) | ragScoreInfo | hallucinationScore | ragScoreMessage | ragInfo
Enter fullscreen mode Exit fullscreen mode

In this flow there are three steps related to implementing the guardrails:

  • ragScoreInfo collects scoring data.
  • hallucinationScore evaluates the inputs and outputs for hallucinations.
  • ragScoreMessage provides messages related to hallucination risks.

To make this flow available, you must deploy the flows to your watsonx Flows Engine endpoint by running the following command:

wxflows deploy
Enter fullscreen mode Exit fullscreen mode

The endpoint that the flows were deployed to are printed in your terminal, and you’ll need these in the next step to test out the contextual grounding checks and hallucination detection.

The next section covers how to use the myRagWithGuardrails flow with either the JavaScript or Python SDK, together with an LLM that’s available on watsonx.ai - assuming that you set up the connection to watsonx.ai by following the third step in Build a RAG application with watsonx.ai flows engine guide from the previous section.

Use the JavaScript SDK for watsonx Flows Engine

You can use the JavaScript SDK (or Python SDK) to send a request to your watsonx Flows Engine endpoint. The upside of using the SDK over a "plain" HTTPS request is the ease of use when you want to integrate watsonx Flows Engine into your projects. When you want to use the JavaScript SDK, you need to have Node.js installed on your machine and use the following steps.

First, set up a new JavaScript project in a new directory.

npm init -y
Enter fullscreen mode Exit fullscreen mode

In this new project, you must install the JavaScript SDK that’s available from npm by running the following command.

npm i wxflows
Enter fullscreen mode Exit fullscreen mode

After the installation is complete, you can create a new file, for example, index.js, and pass the following code in this new file.

const wxflows = require('wxflows')

async function getAnswer() {
    const model = new wxflows({
        endpoint: YOUR_WXFLOWS_ENDPOINT,
        apikey: YOUR_WXFLOWS_APIKEY,
    })

    const schema = await model.generate()

    const result = await model.flow({
        schema,
        flowName: 'myRagWithGuardrails',
        variables: {
            n: 5,
            question: 'What is watsonx.ai?',
            aiEngine: 'WATSONX',
            model: 'ibm/granite-13b-chat-v2',
            collection: 'watsonxdocs',
            parameters: {
                max_new_tokens: 400,
                temperature: 0.7,
            },
        },
    })

    const response = result?.data?.myRagWithGuardrails?.out

    console.log('Response:', response?.modelResponse, 'Guardrails:', response?.hallucinationScore, 'Score:', response?.scoreMessage)
}

getAnswer()
Enter fullscreen mode Exit fullscreen mode

Finally, to run this piece of JavaScript code, use the following command.

node index.js
Enter fullscreen mode Exit fullscreen mode

The previous JavaScript code sends a request to your endpoint with the question "what is watsonx.ai." The model that is used for text generation is granite-13b-chat-v2. The myRagWithGuardrails flow returns both the answer and the following scores:

"hallucinationScore": {
   "answer_relevance": 0.4280790388584137,
   "context_relevance": 0.8915192484855652,
   "groundedness": 0.8474840521812439
}
Enter fullscreen mode Exit fullscreen mode

These scores are all relatively high, with no indication of hallucination. This means that the response is considered relevant and grounded in the provided context because it’s higher than the default aggregated value of 0.80. You might get slightly different results when you run the previous commands on your own endpoint because LLMs are probabilistic and can return a different answer at any given moment.

In the next section, you try out the same flow with a different LLM to see how this affects the scores.

Improving scores with different models and prompts

You can improve the quality of responses by changing the model or adjusting the prompt. For instance, switching to the Mistral Large model can result in a different response. Trying out different LLMs in watsonx Flows Engine is seamless. To use a different model, the only thing you need to change is the JavaScript SDK function that you use to send your request:

const result = await model.flow({
    schema,
    flowName: 'myRagWithGuardrails',
    variables: {
        n: 5,
        question: 'What is watsonx.ai?',
        aiEngine: 'WATSONX',
        model: 'mistralai/mistral-large',
        collection: 'watsonxdocs',
        parameters: {
            max_new_tokens: 400,
            temperature: 0.7,
        },
    },
})
Enter fullscreen mode Exit fullscreen mode

Take a look at the Using Mistral AI LLMs in watsonx Flows Engine tutorial to learn more about the Mistral Large model.

Then, you can run the previous code with the following command.

node index.js
Enter fullscreen mode Exit fullscreen mode

In your terminal, you can see the following results being printed.

"hallucinationScore": {
    "answer_relevance": 0.6511902213096619,
    "context_relevance": 0.8915192484855652,
    "groundedness": 0.8460367321968079
}
Enter fullscreen mode Exit fullscreen mode

Here, the answer relevance has increased slightly, though the groundedness score remains similar. These variations let you fine-tune the performance of your RAG applications by trying different models and seeing what model has the best results for your data. Keep in mind that the results on your machine could be different because of the probabilistic nature of LLMs.

Another way to improve the scores is by changing your prompt. You can use the prompt to tell the LLM to not hallucinate and only use the context provided to answer your question. When using the promptFromTopN step in your flow, these guidelines are automatically included in the prompt.

Handling low scores

Sometimes, a query might result in low scores, indicating a poor response. This could be the case when you ask a question that’s irrelevant to the data provided to the LLM or a question that’s outside the scope of the data that the LLM has been trained on.

Let’s look at the following example, where you ask the model to answer the question "how to implement a Fibonacci sequence in watsonx." Not only does the data set lack information on coding Fibonacci sequences, it’s impossible to directly implement a Fibonacci sequence in watsonx because it’s a platform to work with data and models — not a mathematical playground.

To try out this question, make the following change to the index.js file:

const result = await model.flow({
    schema,
    flowName: 'myRagWithGuardrails',
    variables: {
        n: 5,
        question: 'How to implement a Fibonacci sequence in watsonx',
        aiEngine: 'WATSONX',
        model: 'ibm/granite-13b-chat-v2',
        collection: 'watsonxdocs',
        parameters: {
            max_new_tokens: 400,
            temperature: 0.7,
        },
    },
})
Enter fullscreen mode Exit fullscreen mode

This returns a score like the following:

"hallucinationScore": {
    "answer_relevance": 0.02507544681429863,
    "context_relevance": 0.9424963593482971,
    "groundedness": 0.09882719814777374
}
Enter fullscreen mode Exit fullscreen mode

In this case, the answer relevance and groundedness are close to zero, signaling a clear hallucination. The model even returns a warning: "LOW GROUNDEDNESS." Because watsonx is designed for AI tasks, not software development, such queries fall outside its scope and the answer isn’t grounded in the relevant context either.

By default, the groundedness limit is set to 0.80, but you can modify this limit in your flow based on your needs. This flexibility ensures that you can strike the right balance between precision and creativity when handling different types of queries.

Conclusion

In this tutorial, you learned how to implement LLM guardrails in watsonx Flows Engine to ensure the reliability of your RAG applications. By leveraging contextual grounding checks, scoring metrics, and customizable models, you can fine-tune your flows to provide accurate, relevant, and grounded responses. Combined with the flexible, low-code nature of watsonx Flows Engine, this added layer of safety minimizes hallucinations, making AI-driven applications more trustworthy for end users.

Want to know more about LLM Guardrails or using watsonx Flows Engine? Join our Discord community and get in touch!

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .