Agentic RAG Data & AI

Using Nuclia’s RAG Evaluation Tools

by Eudald Camprubi Posted on October 21, 2025

Previously published on Nuclia.com. Nuclia is now Progress Agentic RAG.

Nuclia’s RAG evaluation tools, REMi and nuclia-eval, can be used to identify and resolve issues in a failing RAG pipeline. The tools are based on the RAG Triad framework, which evaluates the query, contexts and answer in relation to each other using these metrics:

Answer Relevance: How relevant is the generated answer to the user query?
Context Relevance: How relevant is the retrieved context to the user query?
Groundedness: To what degree is the generated answer grounded in the retrieved context?

By analyzing these metrics, users can identify specific weaknesses in their RAG pipelines. Let’s see three case studies of failing RAG pipelines and explain how REMi and nuclia-eval can be used to diagnose and solve the issues.

Case 1: Unverifiable Claims

This scenario is characterized by high Answer Relevance but low Context Relevance and Groundedness. This means that the generated answer is relevant to the query, but it’s not supported by the retrieved context. This suggests that the retrieval model is failing to find relevant information and the LLM is relying on its own internal knowledge to generate the answer. This is problematic because the LLM might be hallucinating or providing inaccurate information.

Example:

Query: Which is the best bakery in Lyon?
Context: Lyon is recognized for its cuisine and gastronomy, as well as its historical and architectural landmarks.
Generated Answer: One of the most renowned bakeries in Lyon is Pralus, famous for its Praluline.

Solutions:

Change the RAG strategy: (check Nuclia’s retrieval strategies) This could involve retrieving whole documents instead of paragraphs or adding visual information by including images in the context.
Perform data augmentation: This involves enriching the knowledge base with more information, such as document summaries, metadata or image captions.
Improve the retrieval components: This might involve adjusting the weighting of different indices (e.g., full-text search vs. semantic search) or adding new indices with different semantic models.

Case 2: Evasive Response

In this case, the RAG pipeline exhibits low Answer Relevance and Groundedness, but high Context Relevance. The retrieved context contains the necessary information to answer the query, but the LLM fails to generate a relevant answer. This suggests a problem with the LLM’s ability to understand and synthesize the information from the context.

Example:

Query: Which are the two official languages of Eswatini?
Context: Eswatini was a British protectorate until its independence in 1968. They maintain the official language established during that period by their colonizers, together with Swati, their native language.
Generated Answer: The context does not give enough information about the topic to answer the question.

Solutions:

Use a more powerful LLM: A more sophisticated LLM might have better reasoning capabilities and be able to extract the answer from the context.
Re-check the prompt templates: The prompts might be too restrictive, preventing the LLM from generating a relevant answer.
Adjust safety settings: If the LLM’s safety filters are too strict, they might be preventing it from generating an appropriate answer.

Case 3: Unrelated Answers

This scenario is characterized by high Groundedness but low Answer and Context Relevance. The generated answer is grounded in the context, but both the answer and the context are irrelevant to the query. This indicates that the retrieval model is failing to retrieve relevant context, and the LLM is generating an answer based on irrelevant information.

Example:

Query: What is the best cafe in Amsterdam?
Context: Leading the charts of the best coffee shops in Amsterdam is the Green House.
Generated Answer: The Green House coffee shop is a must-visit in Amsterdam.

Solutions:

This issue requires addressing both the retrieval model and the LLM.

Improve the retrieval model using the solutions outlined in Case 1.
Engineer the prompt template to encourage the LLM to consider the query when generating the answer. This might involve instructing the LLM to only use relevant information or to avoid generating an answer if the context is irrelevant.
Confirm the LLM isn’t forced to generate an answer. The prompt should allow the LLM to indicate if it doesn’t have enough information.
Clearly separate the context and query in the prompt. This can be done using separators like Markdown tags or code blocks.
Use a more powerful LLM.

By utilizing these evaluation tools and solutions, developers can improve the accuracy and reliability of their RAG pipelines. Would you like to test yourself? Create your Progress Agentic account at rag.progress.cloud

Eudald Camprubi

View all posts from Eudald Camprubi on the Progress blog. Connect with us about all things application development and deployment, data integration and digital business.

Related Tags

Agentic RAG

Using Nuclia’s RAG Evaluation Tools

Case 1: Unverifiable Claims

Example:

Solutions:

Case 2: Evasive Response

Example:

Solutions:

Case 3: Unrelated Answers

Example:

Solutions:

Eudald Camprubi

Related Tags:

Related Products:

Agentic RAG

Related Tags

Using Nuclia’s RAG Evaluation Tools

Case 1: Unverifiable Claims

Example:

Solutions:

Case 2: Evasive Response

Example:

Solutions:

Case 3: Unrelated Answers

Example:

Solutions:

Eudald Camprubi

Related Tags:

Related Products:

Agentic RAG

Related Tags

Latest Stories in Your Inbox